Managing and Reducing SGLang Warmup Time
On this Page
Managing and Reducing SGLang Warmup Time¶
This section provides guidance on reducing warmup time during SGLang model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, optimization strategies, and configuration parameters to improve the model performance. For more information about the warmup process, see SGLang Gaudi Documentation.
Reducing Warmup Time with HPU Graph Caching¶
Intel Gaudi software supports caching of compiled HPU graphs using the
PT_HPU_RECIPE_CACHE_CONFIG
environment variable. This can significantly
reduce startup time by reusing previously compiled graphs.
Configuration¶
The environment variable is set using the following format:
export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>
RECIPE_CACHE_PATH
- Sets directory to store compiled graph recipes.RECIPE_CACHE_DELETE
:True
- Clears existing contents before storing new graph-compiled recipes.False
- Uses the graph-compiled recipes stored in theRECIPE_CACHE_PATH
and speeds up the warmup.
RECIPE_CACHE_SIZE_MB
- Sets maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes (based on file creation time). It is recommended to adjust the cache directory size according to the model and use case requirements.
Examples:
First-time run (store new recipes):
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',True,8192
Subsequent run (reuse recipes):
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',False,8192
Note
The graph compilation process includes two stages: GC graph compilation and HPU graph compilation. When
PT_HPU_RECIPE_CACHE_CONFIG
is used, the GC stage is skipped by reusing cached graphs, which significantly reduces overall compilation time. However, the HPU graph compilation step is still performed.The graph has to be regenerated in the following cases:
PyTorch container or Gaudi software version changes.
Platform changes (e.g., Gaudi 2 to Gaudi 3).
Model tensor parallelism or data type changes (e.g., BF16 to FP8 or FP8 to BF16).
Storage Recommendations¶
Use local disk when caching is shared across processes (scale-up). Avoid remote filesystems (e.g., NFS) due to unsupported file locking.
For Kubernetes:
Store cache in PVC/NFS.
Copy it to local disk before use.
Deployment with SGLang¶
Add cache parameter to the serving command as shown in the below example for Llama 3.1 8B:
# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_sglang_cache/',False,8192
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--tp-size 1 \
--max-total-tokens 8192 \
--host 0.0.0.0 \
--port 30000