Managing and Reducing vLLM Warmup Time
On this Page
Managing and Reducing vLLM Warmup Time¶
This section provides guidance on reducing warmup time during vLLM model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, bucketing strategies, and experimental features to improve the model performance. For more information about the warmup process, see Warmup.
Reducing Warmup Time with HPU Graph Caching¶
Intel Gaudi software supports caching of compiled HPU graphs using the
PT_HPU_RECIPE_CACHE_CONFIG
environment variable. This can significantly
reduce startup time by reusing previously compiled graphs. For more details, see HPU Graph Capture.
Configuration¶
The environment variable is set using the following format:
export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>
RECIPE_CACHE_PATH
- Sets directory to store compiled graph recipes.RECIPE_CACHE_DELETE
:True
- Clears existing contents before storing new graph-compiled recipes.False
- Uses the graph-compiled recipes stored in theRECIPE_CACHE_PATH
and speeds up the warmup.
RECIPE_CACHE_SIZE_MB
- Sets maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes (based on file creation time). It is recommended to adjust the cache directory size according to the model and use case requirements.
Examples:
First-time run (store new recipes):
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',True,8192
Subsequent run (reuse recipes):
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',False,8192
Note
The graph compilation process includes two stages: GC graph compilation and HPU graph compilation. When
PT_HPU_RECIPE_CACHE_CONFIG
is used, the GC stage is skipped by reusing cached graphs, which significantly reduces overall compilation time. However, the HPU graph compilation step is still performed.The graph has to be regenerated in the following cases:
PyTorch container or Gaudi software version changes.
Platform changes (e.g., Gaudi 2 to Gaudi 3).
Model tensor parallelism or data type changes (e.g., BF16 to FP8 or FP8 to BF16).
Storage Recommendations¶
Use local disk when caching is shared across processes (scale-up). Avoid remote filesystems (e.g., NFS) due to unsupported file locking.
For Kubernetes:
Store cache in PVC/NFS.
Copy it to local disk before use.
For a usage example, refer to Intel Gaudi Tutorials.
Deployment with vLLM¶
Add cache parameter to the serving command as shown in the below example for Llama 3.1 8B:
# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',False,8192
VLLM_PROMPT_BS_BUCKET_MAX=256 \
VLLM_DECODE_BS_BUCKET_MIN=128 \
VLLM_DECODE_BS_BUCKET_STEP=128 \
VLLM_DECODE_BS_BUCKET_MAX=128 \
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 \
VLLM_DECODE_BLOCK_BUCKET_MAX=1024 \
PT_HPU_WEIGHT_SHARING=0 PT_HPU_MAX_COMPOUND_OP_SIZE=30 PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.1-8B-instruct -tp 1 --weights-load-device cpu --max-model-len 8192
Result:
Precision
Without Cache
With Cache
Time Reduction
BF16
66 sec
23 sec
~65% faster
FP8
504 sec
34 sec
~93% faster
No changes are required in the Dockerfile as recipe cache is specific to the model and use case. Use the -e
flag to set the environment variable:
-e PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
Bucket Management¶
vLLM warmup time is determined by the number of HPU graphs that must be compiled to support
dynamic shapes, which are influenced by the batch_size
and sequence_length
.
For more details, see Bucketing mechanism.
The following parameters define the upper limit for graph compilation. Setting them based on max_model_len
ensures that
additional graphs are not compiled during runtime:
Sequence length max (
VLLM_PROMPT_SEQ_BUCKET_MAX
):max_model_len
Block size max (
VLLM_DECODE_BLOCK_BUCKET_MAX
):max(128, (max_num_seqs*2048)/block_size)
In addition, it is recommended to follow the guidelines for setting initial bucket parameters mentioned in Recommended vLLM Parameters.
Experimental Features¶
Exponential Bucketing¶
Setting VLLM_EXPONENTIAL_BUCKETING=True
enables exponential bucketing
instead of the default linear grid method. This can reduce the number
of buckets and warmup time by up to 80%, while generally maintaining
equivalent inference performance. However, in some configurations,
it may cause a performance drop due to increased padding. This setting is particularly
effective for BF16 and FP8 models.
Runtime Scale Patching¶
vLLM warmup time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.
FP8 warmup time can be reduced by setting the RUNTIME_SCALE_PATCHING=1
environment variable and
selecting a hardware-aligned per-tensor scale_method
provided by the INC JSON config.
This feature is recommended for larger models (e.g., 70B or 405B). When combined with
VLLM_EXPONENTIAL_BUCKETING
for FP8 models, it can reduce warmup time by up to 90%.
Note
This feature reduces FP8 warmup time but may lower model throughput by 5-10%. Future releases will improve performance and extend support to more ops.
Available only with Lazy mode (
PT_HPU_LAZY_MODE=1
). Support oftorch.compile
will be added in subsequent releases.Supports Llama workloads using FP8 execution of Linear layers, and casting ops between BF16 and FP8. MoE and Convolution ops are not yet supported.
FSDPA on Gaudi 2 has a known accuracy issue when used with vLLM for single-card Llama workloads. To bypass this issue:
Exclude
fused_scaled_dot_product_attention
from INC quantization by adding it to the blocklist as described in the INC JSON config. This forces FSDPA to run in higher precision.Set
VLLM_PROMPT_USE_FUSEDSDPA=0
to use the standardscaled_dot_product_attention
op instead, enabling quantized execution.