Managing and Reducing vLLM Warmup Time
On this Page
Managing and Reducing vLLM Warmup Time¶
This section provides guidance on reducing warmup time during vLLM model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, bucketing strategies, and experimental features to improve the model performance. For more information about the warmup process, see Warmup.
Reducing Warmup Time with HPU Graph Caching¶
Intel Gaudi software supports caching of compiled HPU graphs using the
PT_HPU_RECIPE_CACHE_CONFIG
environment variable. This can significantly
reduce startup time by reusing previously compiled graphs. For more details, see HPU Graph Capture.
Configuration¶
The environment variable is set using the following format:
export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>
RECIPE_CACHE_PATH
- Sets directory to store compiled graph recipes.RECIPE_CACHE_DELETE
:True
- Clears existing contents before storing new graph-compiled recipes.False
- Uses the graph-compiled recipes stored in theRECIPE_CACHE_PATH
and speeds up the warmup.
RECIPE_CACHE_SIZE_MB
- Sets maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes (based on file creation time). It is recommended to adjust the cache directory size according to the model and use case requirements.
Examples:
First-time run (store new recipes):
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',True,8192
Subsequent run (reuse recipes):
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',False,8192
Note
The graph compilation process includes two stages: GC graph compilation and HPU graph compilation. When
PT_HPU_RECIPE_CACHE_CONFIG
is used, the GC stage is skipped by reusing cached graphs, which significantly reduces overall compilation time. However, the HPU graph compilation step is still performed.The graph has to be regenerated in the following cases:
PyTorch container or Gaudi software version changes.
Platform changes (e.g., Gaudi 2 to Gaudi 3).
Model tensor parallelism or data type changes (e.g., BF16 to FP8 or FP8 to BF16).
Storage Recommendations¶
Use local disk when caching is shared across processes (scale-up). Avoid remote filesystems (e.g., NFS) due to unsupported file locking.
For Kubernetes:
Store cache in PVC/NFS.
Copy it to local disk before use.
For a usage example, refer to Intel Gaudi Tutorials.
Deployment with vLLM¶
Add cache parameter to the serving command as shown in the below example for Llama 3.1 8B:
# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',False,8192
VLLM_PROMPT_BS_BUCKET_MAX=256 \
VLLM_DECODE_BS_BUCKET_MIN=128 \
VLLM_DECODE_BS_BUCKET_STEP=128 \
VLLM_DECODE_BS_BUCKET_MAX=128 \
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 \
VLLM_DECODE_BLOCK_BUCKET_MAX=1024 \
PT_HPU_WEIGHT_SHARING=0 PT_HPU_MAX_COMPOUND_OP_SIZE=30 PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.1-8B-instruct -tp 1 --weights-load-device cpu --max-model-len 8192
Result:
Precision
Without Cache
With Cache
Time Reduction
BF16
66 sec
23 sec
~65% faster
FP8
504 sec
34 sec
~93% faster
No changes are required in the Dockerfile as recipe cache is specific to the model and use case. Use the -e
flag to set the environment variable:
-e PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
Bucket Management¶
vLLM warmup time is determined by the number of HPU graphs that must be compiled to support
dynamic shapes, which are influenced by the batch_size
and sequence_length
.
For more details, see Bucketing mechanism.
The following parameters define the upper limit for graph compilation. Setting them based on max_model_len
ensures that
additional graphs are not compiled during runtime:
Sequence length max (
VLLM_PROMPT_SEQ_BUCKET_MAX
):max_model_len
Block size max (
VLLM_DECODE_BLOCK_BUCKET_MAX
):max(128, (max_num_seqs*2048)/block_size)
In addition, it is recommended to follow the guidelines for setting initial bucket parameters mentioned in Recommended vLLM Parameters.
Exponential Bucketing¶
The VLLM_EXPONENTIAL_BUCKETING=True
flag, enabled by default starting with the 1.21.0-post1 vLLM
release, switches the bucketing strategy from linear to exponential. This can reduce the number
of buckets and warmup time by up to 80%, while generally maintaining
equivalent inference performance. However, in some configurations,
it may cause a performance drop due to increased padding. This setting is particularly
effective for BF16 and FP8 models.
Linear bucketing can be enabled by setting VLLM_EXPONENTIAL_BUCKETING=False
.
Experimental Features¶
Runtime Scale Patching¶
vLLM warmup time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.
FP8 warmup time can be reduced by setting the RUNTIME_SCALE_PATCHING=1
environment variable and
selecting a hardware-aligned per-tensor scale_method
provided by the INC JSON config.
This feature is recommended for larger models (e.g., 70B or 405B). When combined with
VLLM_EXPONENTIAL_BUCKETING
for FP8 models, it can reduce warmup time by up to 90%.
Note
This feature reduces FP8 warmup time but may lower model throughput by 5-20%. Future releases will improve performance and extend support to more ops.
Supported with Lazy mode (
PT_HPU_LAZY_MODE=1
) and code:torch.compile.Supports Llama workloads using FP8 execution of Linear and FSDPA layers, and casting ops between BF16 and FP8. MoE and Convolution ops are not yet supported.
Trivial Scales Optimization¶
The PT_HPU_H2D_TRIVIAL_SCALES_MODE
flag controls trivial scales (i.e. scale value is equal to 1.0) optimization in the RUNTIME_SCALE_PATCHING
mode.
Enabling this optimization can increase warmup and compilation time due to the generation of additional graphs,
but may improve runtime performance by reducing the number of multiplication operations.
The following values are supported:
0
- No optimization (default).1
- Removes scales equal to 1.0 incast_to_fp8_v2
andcast_from_fp8
, disabling the correspondingmult_fwd
(multiplication) node.2
- Applies the same optimization as mode1
, and additionally removes reciprocal scales infp8_gemm_v2
.