Managing and Reducing vLLM Warmup Time

This section provides guidance on reducing warmup time during vLLM model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, bucketing strategies, and experimental features to improve the model performance. For more information about the warmup process, see Warmup.

Reducing Warmup Time with HPU Graph Caching

Intel Gaudi software supports caching of compiled HPU graphs using the PT_HPU_RECIPE_CACHE_CONFIG environment variable. This can significantly reduce startup time by reusing previously compiled graphs. For more details, see HPU Graph Capture.

Configuration

The environment variable is set using the following format:

export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>
  • RECIPE_CACHE_PATH - Sets directory to store compiled graph recipes.

  • RECIPE_CACHE_DELETE:

    • True - Clears existing contents before storing new graph-compiled recipes.

    • False - Uses the graph-compiled recipes stored in the RECIPE_CACHE_PATH and speeds up the warmup.

  • RECIPE_CACHE_SIZE_MB - Sets maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes (based on file creation time). It is recommended to adjust the cache directory size according to the model and use case requirements.

Examples:

  • First-time run (store new recipes):

    export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',True,8192
    
  • Subsequent run (reuse recipes):

    export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',False,8192
    

Note

  • The graph compilation process includes two stages: GC graph compilation and HPU graph compilation. When PT_HPU_RECIPE_CACHE_CONFIG is used, the GC stage is skipped by reusing cached graphs, which significantly reduces overall compilation time. However, the HPU graph compilation step is still performed.

  • The graph has to be regenerated in the following cases:

    • PyTorch container or Gaudi software version changes.

    • Platform changes (e.g., Gaudi 2 to Gaudi 3).

    • Model tensor parallelism or data type changes (e.g., BF16 to FP8 or FP8 to BF16).

Storage Recommendations

  • Use local disk when caching is shared across processes (scale-up). Avoid remote filesystems (e.g., NFS) due to unsupported file locking.

  • For Kubernetes:

    • Store cache in PVC/NFS.

    • Copy it to local disk before use.

    For a usage example, refer to Intel Gaudi Tutorials.

Deployment with vLLM

Add cache parameter to the serving command as shown in the below example for Llama 3.1 8B:

# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',False,8192
VLLM_PROMPT_BS_BUCKET_MAX=256 \
VLLM_DECODE_BS_BUCKET_MIN=128 \
VLLM_DECODE_BS_BUCKET_STEP=128 \
VLLM_DECODE_BS_BUCKET_MAX=128 \
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 \
VLLM_DECODE_BLOCK_BUCKET_MAX=1024 \
PT_HPU_WEIGHT_SHARING=0 PT_HPU_MAX_COMPOUND_OP_SIZE=30 PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.1-8B-instruct -tp 1 --weights-load-device cpu --max-model-len 8192

Result:

Precision

Without Cache

With Cache

Time Reduction

BF16

66 sec

23 sec

~65% faster

FP8

504 sec

34 sec

~93% faster

No changes are required in the Dockerfile as recipe cache is specific to the model and use case. Use the -e flag to set the environment variable:

-e PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192

Bucket Management

vLLM warmup time is determined by the number of HPU graphs that must be compiled to support dynamic shapes, which are influenced by the batch_size and sequence_length. For more details, see Bucketing mechanism.

The following parameters define the upper limit for graph compilation. Setting them based on max_model_len ensures that additional graphs are not compiled during runtime:

  • Sequence length max (VLLM_PROMPT_SEQ_BUCKET_MAX): max_model_len

  • Block size max (VLLM_DECODE_BLOCK_BUCKET_MAX): max(128, (max_num_seqs*2048)/block_size)

In addition, it is recommended to follow the guidelines for setting initial bucket parameters mentioned in Recommended vLLM Parameters.

Experimental Features

Exponential Bucketing

Setting VLLM_EXPONENTIAL_BUCKETING=True enables exponential bucketing instead of the default linear grid method. This can reduce the number of buckets and warmup time by up to 80%, while generally maintaining equivalent inference performance. However, in some configurations, it may cause a performance drop due to increased padding. This setting is particularly effective for BF16 and FP8 models.

Runtime Scale Patching

vLLM warmup time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.

FP8 warmup time can be reduced by setting the RUNTIME_SCALE_PATCHING=1 environment variable and selecting a hardware-aligned per-tensor scale_method provided by the INC JSON config. This feature is recommended for larger models (e.g., 70B or 405B). When combined with VLLM_EXPONENTIAL_BUCKETING for FP8 models, it can reduce warmup time by up to 90%.

Note

  • This feature reduces FP8 warmup time but may lower model throughput by 5-10%. Future releases will improve performance and extend support to more ops.

  • Available only with Lazy mode (PT_HPU_LAZY_MODE=1). Support of torch.compile will be added in subsequent releases.

  • Supports Llama workloads using FP8 execution of Linear layers, and casting ops between BF16 and FP8. MoE and Convolution ops are not yet supported.

  • FSDPA on Gaudi 2 has a known accuracy issue when used with vLLM for single-card Llama workloads. To bypass this issue:

    • Exclude fused_scaled_dot_product_attention from INC quantization by adding it to the blocklist as described in the INC JSON config. This forces FSDPA to run in higher precision.

    • Set VLLM_PROMPT_USE_FUSEDSDPA=0 to use the standard scaled_dot_product_attention op instead, enabling quantized execution.