Managing and Reducing SGLang Warmup Time¶

This section provides guidance on reducing warmup time during SGLang model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, optimization strategies, and configuration parameters to improve the model performance. For more information about the warmup process, see SGLang Gaudi Documentation.

Reducing Warmup Time with HPU Graph Caching¶

Intel Gaudi software supports caching of compiled HPU graphs using the PT_HPU_RECIPE_CACHE_CONFIG environment variable. This can significantly reduce startup time by reusing previously compiled graphs.

Configuration¶

The environment variable is set using the following format:

export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>

RECIPE_CACHE_PATH - Sets directory to store compiled graph recipes.
RECIPE_CACHE_DELETE:
- True - Clears existing contents before storing new graph-compiled recipes.
- False - Uses the graph-compiled recipes stored in the RECIPE_CACHE_PATH and speeds up the warmup.
RECIPE_CACHE_SIZE_MB - Sets maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes (based on file creation time). It is recommended to adjust the cache directory size according to the model and use case requirements.

Examples:

First-time run (store new recipes):

export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',True,8192

Subsequent run (reuse recipes):

export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',False,8192

Note

The graph compilation process includes two stages: GC graph compilation and HPU graph compilation. When PT_HPU_RECIPE_CACHE_CONFIG is used, the GC stage is skipped by reusing cached graphs, which significantly reduces overall compilation time. However, the HPU graph compilation step is still performed.
The graph has to be regenerated in the following cases:
- PyTorch container or Gaudi software version changes.
- Platform changes (e.g., Gaudi 2 to Gaudi 3).
- Model tensor parallelism or data type changes (e.g., BF16 to FP8 or FP8 to BF16).

Storage Recommendations¶

Use local disk when caching is shared across processes (scale-up). Avoid remote filesystems (e.g., NFS) due to unsupported file locking.
For Kubernetes:
- Store cache in PVC/NFS.
- Copy it to local disk before use.

Deployment with SGLang¶

Serving Command

Add cache parameter to the serving command as shown in the below example for Llama 3.1 8B:

# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',False,8192

python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --tp-size 1 \
    --max-total-tokens 8192 \
    --host 0.0.0.0 \
    --port 30000

Gaudi Documentation 1.22.1 documentation

Managing and Reducing SGLang Warmup Time

On this Page

Managing and Reducing SGLang Warmup Time¶

Reducing Warmup Time with HPU Graph Caching¶

Configuration¶

Storage Recommendations¶

Deployment with SGLang¶