Managing and Reducing vLLM Warmup Time

This section provides guidance on reducing warmup time during vLLM model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, bucketing strategies, and experimental features to improve the model performance. For more information about the warmup process, see Warmup.

Reducing Warmup Time with HPU Graph Caching

Intel Gaudi software supports caching of compiled HPU graphs using the PT_HPU_RECIPE_CACHE_CONFIG environment variable. This can significantly reduce startup time by reusing previously compiled graphs. For more details, see HPU Graph Capture.

Configuration

The environment variable is set using the following format:

export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>
  • RECIPE_CACHE_PATH - Sets directory to store compiled graph recipes.

  • RECIPE_CACHE_DELETE:

    • True - Clears existing contents before storing new graph-compiled recipes.

    • False - Uses the graph-compiled recipes stored in the RECIPE_CACHE_PATH and speeds up the warmup.

  • RECIPE_CACHE_SIZE_MB - Sets maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes (based on file creation time). It is recommended to adjust the cache directory size according to the model and use case requirements.

Examples:

  • First-time run (store new recipes):

    export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',True,8192
    
  • Subsequent run (reuse recipes):

    export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_70b_recipe_cache/',False,8192
    

Note

  • The graph compilation process includes two stages: GC graph compilation and HPU graph compilation. When PT_HPU_RECIPE_CACHE_CONFIG is used, the GC stage is skipped by reusing cached graphs, which significantly reduces overall compilation time. However, the HPU graph compilation step is still performed.

  • The graph has to be regenerated in the following cases:

    • PyTorch container or Gaudi software version changes.

    • Platform changes (e.g., Gaudi 2 to Gaudi 3).

    • Model tensor parallelism or data type changes (e.g., BF16 to FP8 or FP8 to BF16).

Storage Recommendations

  • Use local disk when caching is shared across processes (scale-up). Avoid remote filesystems (e.g., NFS) due to unsupported file locking.

  • For Kubernetes:

    • Store cache in PVC/NFS.

    • Copy it to local disk before use.

    For a usage example, refer to Intel Gaudi Tutorials.

Deployment with vLLM

Add cache parameter to the serving command as shown in the below example for Llama 3.1 8B:

# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',False,8192
VLLM_PROMPT_BS_BUCKET_MAX=256 \
VLLM_DECODE_BS_BUCKET_MIN=128 \
VLLM_DECODE_BS_BUCKET_STEP=128 \
VLLM_DECODE_BS_BUCKET_MAX=128 \
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 \
VLLM_DECODE_BLOCK_BUCKET_MAX=1024 \
PT_HPU_WEIGHT_SHARING=0 PT_HPU_MAX_COMPOUND_OP_SIZE=30 PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.1-8B-instruct -tp 1 --weights-load-device cpu --max-model-len 8192

Result:

Precision

Without Cache

With Cache

Time Reduction

BF16

66 sec

23 sec

~65% faster

FP8

504 sec

34 sec

~93% faster

No changes are required in the Dockerfile as recipe cache is specific to the model and use case. Use the -e flag to set the environment variable:

-e PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192

Bucket Management

vLLM warmup time is determined by the number of HPU graphs that must be compiled to support dynamic shapes, which are influenced by the batch_size and sequence_length. For more details, see Bucketing mechanism.

The following parameters define the upper limit for graph compilation. Setting them based on max_model_len ensures that additional graphs are not compiled during runtime:

  • Sequence length max (VLLM_PROMPT_SEQ_BUCKET_MAX): max_model_len

  • Block size max (VLLM_DECODE_BLOCK_BUCKET_MAX): max(128, (max_num_seqs*2048)/block_size)

In addition, it is recommended to follow the guidelines for setting initial bucket parameters mentioned in Recommended vLLM Parameters.

Exponential Bucketing

The VLLM_EXPONENTIAL_BUCKETING=True flag, enabled by default starting with the 1.21.0-post1 vLLM release, switches the bucketing strategy from linear to exponential. This can reduce the number of buckets and warmup time by up to 80%, while generally maintaining equivalent inference performance. However, in some configurations, it may cause a performance drop due to increased padding. This setting is particularly effective for BF16 and FP8 models. Linear bucketing can be enabled by setting VLLM_EXPONENTIAL_BUCKETING=False.

Experimental Features

Runtime Scale Patching

vLLM warmup time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.

FP8 warmup time can be reduced by setting the RUNTIME_SCALE_PATCHING=1 environment variable and selecting a hardware-aligned per-tensor scale_method provided by the INC JSON config. This feature is recommended for larger models (e.g., 70B or 405B). When combined with VLLM_EXPONENTIAL_BUCKETING for FP8 models, it can reduce warmup time by up to 90%.

Note

  • This feature reduces FP8 warmup time but may lower model throughput by 5-20%. Future releases will improve performance and extend support to more ops.

  • Supported with Lazy mode (PT_HPU_LAZY_MODE=1) and code:torch.compile.

  • Supports Llama workloads using FP8 execution of Linear and FSDPA layers, and casting ops between BF16 and FP8. MoE and Convolution ops are not yet supported.

Trivial Scales Optimization

The PT_HPU_H2D_TRIVIAL_SCALES_MODE flag controls trivial scales (i.e. scale value is equal to 1.0) optimization in the RUNTIME_SCALE_PATCHING mode. Enabling this optimization can increase warmup and compilation time due to the generation of additional graphs, but may improve runtime performance by reducing the number of multiplication operations.

The following values are supported:

  • 0 - No optimization (default).

  • 1 - Removes scales equal to 1.0 in cast_to_fp8_v2 and cast_from_fp8, disabling the corresponding mult_fwd (multiplication) node.

  • 2 - Applies the same optimization as mode 1, and additionally removes reciprocal scales in fp8_gemm_v2.