Inference Using SGLang
On this Page
Inference Using SGLang¶
The following sections provide instructions on deploying models using SGLang with Intel® Gaudi® AI accelerator. The document is based on the SGLang Inference Server documentation and the SGLang fork for Gaudi integration. The process involves:
Creating a SGLang Docker image for Gaudi
Building and installing the SGLang Inference Server
Sending an Inference Request
For frequently asked questions about using Gaudi with SGLang, see SGLang with Gaudi FAQs.
Creating a Docker Image for Gaudi¶
Since a SGLang server is launched within a Docker container, a Docker image tailored for Gaudi is needed. Follow the instructions below to create and run a Docker image for SGLang on Gaudi.
Prerequisites¶
Ensure you have Intel Gaudi drivers installed on your host system.
Install Docker and ensure it can access Gaudi devices.
Clone the SGLang repository with Gaudi support:
git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork
Building the Docker Image¶
Build the SGLang Docker image optimized for Gaudi:
docker build -f docker/Dockerfile.gaudi -t sglang-gaudi:latest .
Running the Docker Container¶
Launch the SGLang container with Gaudi device access:
docker run -it --runtime=habana \ -e HABANA_VISIBLE_DEVICES=all \ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ --cap-add=sys_nice \ --net=host \ --ipc=host \ -v /path/to/models:/models \ sglang-gaudi:latest
Building and Installing SGLang for Gaudi¶
To build and install SGLang with Gaudi support, follow these steps:
Installation from Source¶
Clone the SGLang repository with Gaudi support:
git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork
Install the required dependencies:
pip install -e python[all_hpu]
Sending an Inference Request¶
There are multiple methods to send inference requests to SGLang on Gaudi:
OpenAI-Compatible Server: Use SGLang’s OpenAI-compatible API endpoint
Native SGLang API: Use SGLang’s native Python API for more advanced features
HTTP REST API: Send direct HTTP requests to the SGLang server
OpenAI-Compatible Server¶
Start the SGLang server with OpenAI-compatible API:
python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3.1-8B \ --host 0.0.0.0 \ --port 30000 \ --tp-size 1 \ --dtype bfloat16
Send requests using OpenAI Python client:
from openai import OpenAI client = OpenAI( base_url="http://localhost:30000/v1", api_key="EMPTY" ) response = client.chat.completions.create( model="meta-llama/Meta-Llama-3.1-8B", messages=[ {"role": "user", "content": "What is the capital of France?"} ], max_tokens=100, temperature=0.7 ) print(response.choices[0].message.content)
Native SGLang API¶
Use SGLang’s native API for advanced features like structured generation:
import sglang as sgl from sglang import function, system, user, assistant, gen, select @function def multi_turn_question(s, question_1, question_2): s += system("You are a helpful assistant.") s += user(question_1) s += assistant(gen("answer_1", max_tokens=256)) s += user(question_2) s += assistant(gen("answer_2", max_tokens=256)) # Set the default backend sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")) # Run the function state = multi_turn_question.run( question_1="What is the capital of the United States?", question_2="List two local attractions there." ) for key, value in state.text(): print(f"{key}: {value}")
For a list of popular models and supported configurations validated on Gaudi, see the supported models documentation.
Understanding SGLang Logs¶
This section uses a basic example to run the Meta-Llama-3.1-8B model on Gaudi SGLang server with default settings:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --dtype bfloat16 --tp-size 1 --host 0.0.0.0 --port 30000
The below shows the initial part of the server log:
INFO 2025-01-15 10:30:15 server.py:123] Loading model meta-llama/Meta-Llama-3.1-8B INFO 2025-01-15 10:30:16 habana_worker.py:89] Initializing Gaudi device: hpu:0 INFO 2025-01-15 10:30:16 habana_worker.py:156] Model loading on hpu:0 took 12.8 GiB of device memory (12.8 GiB/94.62 GiB used) and 950 MiB of host memory INFO 2025-01-15 10:30:17 habana_worker.py:178] HPU Graph compilation started INFO 2025-01-15 10:30:18 habana_worker.py:201] Free device memory: 81.82 GiB, 73.64 GiB usable (memory_utilization=0.9), 65.27 GiB reserved for KV cache INFO 2025-01-15 10:30:18 habana_executor.py:67] # HPU blocks: 4181, # CPU blocks: 512 INFO 2025-01-15 10:30:19 habana_worker.py:234] Cache engine initialization took 65.27 GiB of device memory (78.07 GiB/94.62 GiB used) and 1.2 GiB of host memory
This section shows the memory consumption trends of the chosen model. It includes the device memory used for loading the model weights, HPU Graph compilation, and the memory allocation for KV cache before the warmup phase begins. This information can be used to determine optimal memory utilization settings.
The below shows the warmup phase logs:
INFO 2025-01-15 10:30:25 habana_model_runner.py:445] Warmup started with batch_sizes=[1, 2, 4, 8] and seq_lens=[128, 256, 512, 1024, 2048] INFO 2025-01-15 10:30:27 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=128, time=0.12s INFO 2025-01-15 10:30:28 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=256, time=0.15s INFO 2025-01-15 10:30:30 habana_model_runner.py:467] Prefill warmup: batch_size=4, seq_len=1024, time=0.35s INFO 2025-01-15 10:30:32 habana_model_runner.py:489] Decode warmup: batch_size=8, seq_len=128, time=0.08s INFO 2025-01-15 10:30:33 habana_model_runner.py:512] Warmup finished in 8.2 secs, allocated 156 MiB of additional device memory INFO 2025-01-15 10:30:33 server.py:234] SGLang server is ready at http://0.0.0.0:30000
After analyzing this section of the warmup phase logs, you should have a good idea of how much free device memory remains for overhead calculations and how much more could still be utilized by adjusting memory utilization settings. The user is expected to balance the memory requirements for HPU Graphs and KV Cache based on their unique needs.
Basic Troubleshooting for OOM Errors¶
Due to various factors such as available HPU memory, model size, and input sequence length, the standard inference command may not always work for your model, potentially leading to OOM errors. The following steps help mitigate OOM errors:
Increase
--mem-fraction-static
- This addresses insufficient available memory. SGLang pre-allocates HPU cache by using this fraction of memory. By increasing this value, you can provide more KV cache space.Decrease
--max-running-requests
- This may reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.Increase
--tp-size
- This approach shards model weights across multiple HPUs, so each HPU has more memory available for KV cache.Use
--chunked-prefill-size
- This breaks large prefill sequences into smaller chunks to reduce memory pressure.
During the development phase when evaluating a model for inference on SGLang, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the
--disable-warmup
flag.Note
You may disable warmup only for development but it is highly recommended to enable it in production.
Guidelines for Performance Tuning¶
Some general guidelines for tweaking performance are noted below. Your results may vary:
Warmup should be enabled during deployment for optimal performance. Warmup time depends on many factors, e.g. input and output sequence length, batch size, number of prefill/decode combinations, and data type. Warmup typically takes 30 seconds to a few minutes, depending on the configuration.
Memory Management: Since HPU Graphs and KV Cache share the same memory pool, a balance is required between the two:
Maximizing KV Cache size helps to accommodate bigger batches resulting in increased overall throughput.
Enabling HPU Graphs reduces host overhead times and can be useful for reducing latency.
Use
--mem-fraction-static
to control the fraction of memory used for static allocations.
Batch Size Optimization:
Use
--max-running-requests
to control the maximum number of concurrent requests.Use
--max-prefill-tokens
to limit the number of tokens processed in prefill stage.Use
--max-total-tokens
to set the maximum total tokens that can be processed.
Chunked Prefill: Use
--chunked-prefill-size
to break large prefill sequences into smaller chunks. This helps with:Better memory utilization
Reduced time-to-first-token for long sequences
More stable memory usage patterns
Tensor Parallelism: Use
--tp-size
to distribute model weights across multiple HPUs:Reduces per-device memory usage
Enables serving larger models
Can improve throughput for certain workloads
Data Types: Consider using different precision formats:
--dtype bfloat16
- Standard precision with good balance of accuracy and performance--dtype float16
- Lower precision, potentially faster but may affect quality
Advanced Features:
RadixAttention: Automatic prefix caching for improved efficiency with repeated prefixes
Speculative Decoding: Use smaller draft models to accelerate generation
Continuous Batching: Automatic batching of requests for better throughput
Performance Monitoring¶
Monitor key metrics during inference:
Throughput: Tokens per second generated
Latency: Time to first token and inter-token latency
Memory Usage: HPU memory utilization
Request Queue: Number of pending requests
Use the built-in metrics endpoint to track these values:
curl http://localhost:30000/metrics
Environment Variables¶
Key environment variables for performance tuning:
PT_HPU_LAZY_MODE=1
- Enable lazy mode for better performancePT_HPU_LAZY_MODE=0
- Disable lazy mode for using eager or compile modeHABANA_VISIBLE_DEVICES=all
- Make all HPU devices visibleSGLANG_HPU_SKIP_WARMUP=1
- Disable warmup (development only)