Managing and Reducing vLLM Warmup Time¶

This section provides guidance on reducing warmup time during vLLM model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, bucketing strategies, and experimental features to improve the model performance. For more information about the warmup process, see Warmup.

Reducing Warmup Time with HPU Graph Caching¶

Intel Gaudi software supports caching of compiled HPU graphs using the PT_HPU_RECIPE_CACHE_CONFIG environment variable. This can significantly reduce startup time by reusing previously compiled graphs. For more details, see HPU Graph Capture.

Configuration¶

The environment variable is set using the following format:

export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>

RECIPE_CACHE_PATH - Sets directory to store compiled graph recipes.
RECIPE_CACHE_DELETE:
- True - Clears existing contents before storing new graph-compiled recipes.
- False - Uses the graph-compiled recipes stored in the RECIPE_CACHE_PATH and speeds up the warmup.
RECIPE_CACHE_SIZE_MB - Sets maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes (based on file creation time). It is recommended to adjust the cache directory size according to the model and use case requirements.

Examples:

First-time run (store new recipes):

export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',True,8192

Subsequent run (reuse recipes):

export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',False,8192

Note

The graph compilation process includes two stages: GC graph compilation and HPU graph compilation. When PT_HPU_RECIPE_CACHE_CONFIG is used, the GC stage is skipped by reusing cached graphs, which significantly reduces overall compilation time. However, the HPU graph compilation step is still performed.
The graph has to be regenerated in the following cases:
- PyTorch container or Gaudi software version changes.
- Platform changes (e.g., Gaudi 2 to Gaudi 3).
- Model tensor parallelism or data type changes (e.g., BF16 to FP8 or FP8 to BF16).

Storage Recommendations¶

Use local disk when caching is shared across processes (scale-up). Avoid remote filesystems (e.g., NFS) due to unsupported file locking.
For Kubernetes:
- Store cache in PVC/NFS.
- Copy it to local disk before use.
For a usage example, refer to Intel Gaudi Tutorials.

Deployment with vLLM¶

Serving Command

Add cache parameter to the serving command as shown in the below example for Llama 3.1 8B:

# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',False,8192
VLLM_PROMPT_BS_BUCKET_MAX=256 \
VLLM_DECODE_BS_BUCKET_MIN=128 \
VLLM_DECODE_BS_BUCKET_STEP=128 \
VLLM_DECODE_BS_BUCKET_MAX=128 \
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 \
VLLM_DECODE_BLOCK_BUCKET_MAX=1024 \
PT_HPU_WEIGHT_SHARING=0 PT_HPU_MAX_COMPOUND_OP_SIZE=30 PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.1-8B-instruct -tp 1 --weights-load-device cpu --max-model-len 8192

Result:

Precision

Without Cache

With Cache

Time Reduction

BF16

66 sec

23 sec

~65% faster

FP8

504 sec

34 sec

~93% faster

Precision	Without Cache	With Cache	Time Reduction
BF16	66 sec	23 sec	~65% faster
FP8	504 sec	34 sec	~93% faster

Docker

No changes are required in the Dockerfile as recipe cache is specific to the model and use case. Use the -e flag to set the environment variable:

-e PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192

Bucket Management¶

vLLM warmup time is determined by the number of HPU graphs that must be compiled to support dynamic shapes, which are influenced by the batch_size and sequence_length. For more details, see Bucketing mechanism.

The following parameters define the upper limit for graph compilation. Setting them based on max_model_len ensures that additional graphs are not compiled during runtime:

Sequence length max (VLLM_PROMPT_SEQ_BUCKET_MAX): max_model_len
Block size max (VLLM_DECODE_BLOCK_BUCKET_MAX): max(128, (max_num_seqs*2048)/block_size)

In addition, it is recommended to follow the guidelines for setting initial bucket parameters mentioned in Recommended vLLM Parameters.

Experimental Features¶

Exponential Bucketing¶

Setting VLLM_EXPONENTIAL_BUCKETING=True enables exponential bucketing instead of the default linear grid method. This can reduce the number of buckets and warmup time by up to 80%, while generally maintaining equivalent inference performance. However, in some configurations, it may cause a performance drop due to increased padding. This setting is particularly effective for BF16 and FP8 models.

Runtime Scale Patching¶

vLLM warmup time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.

FP8 warmup time can be reduced by setting the RUNTIME_SCALE_PATCHING=1 environment variable and selecting a hardware-aligned per-tensor scale_method provided by the INC JSON config. This feature is recommended for larger models (e.g., 70B or 405B). When combined with VLLM_EXPONENTIAL_BUCKETING for FP8 models, it can reduce warmup time by up to 90%.