Inference Using SGLang
On this Page
Inference Using SGLang¶
This document outlines the process of deploying models with SGLang using the Intel® Gaudi® AI accelerator. It covers how to create a Gaudi-compatible SGLang Docker image, build and install the SGLang Inference Server, and send inference requests.
Prerequisites¶
Intel® Gaudi® software and drivers installed. Refer to Driver and Software Installation.
Dockers (optional - for containerized deployment). Refer to Docker Installation.
Creating and Running a Docker Image for Gaudi¶
Since a SGLang server is launched within a Docker container, a Docker image tailored for Gaudi is needed. Follow the instructions below to create and run a Docker image for SGLang on Gaudi.
Clone SGLang repository:
git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork
Build the SGLang Docker image optimized for Gaudi:
docker build -f docker/Dockerfile.gaudi -t sglang-gaudi:latest .
Launch the SGLang container with Gaudi device access:
docker run -it --runtime=habana \ -e HABANA_VISIBLE_DEVICES=all \ -e OMPI_MCA_btl_vader_single_copy_mechanism=none \ --cap-add=sys_nice \ --net=host \ --ipc=host \ -v /path/to/models:/models \ sglang-gaudi:latest
Building and Installing SGLang for Gaudi¶
To build and install SGLang with Gaudi, follow the steps below:
Clone the SGLang repository with Gaudi support:
git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork
Install the required dependencies:
pip install -e python[all_hpu]
Sending an Inference Request¶
The following sections describe the available methods to send inference requests to SGLang running on Gaudi.
Use an API endpoint that mimics the OpenAI interface, allowing for easy integration with existing OpenAI-compatible clients.
Start the SGLang server with OpenAI-compatible API:
python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3.1-8B \ --host 0.0.0.0 \ --port 30000 \ --tp-size 1 \ --dtype bfloat16
Send requests using OpenAI Python client:
from openai import OpenAI client = OpenAI( base_url="http://localhost:30000/v1", api_key="EMPTY" ) response = client.chat.completions.create( model="meta-llama/Meta-Llama-3.1-8B", messages=[ {"role": "user", "content": "What is the capital of France?"} ], max_tokens=100, temperature=0.7 ) print(response.choices[0].message.content)
Use SGLang’s native API for advanced features such as structured generation:
import sglang as sgl from sglang import function, system, user, assistant, gen, select @function def multi_turn_question(s, question_1, question_2): s += system("You are a helpful assistant.") s += user(question_1) s += assistant(gen("answer_1", max_tokens=256)) s += user(question_2) s += assistant(gen("answer_2", max_tokens=256)) # Set the default backend sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")) # Run the function state = multi_turn_question.run( question_1="What is the capital of the United States?", question_2="List two local attractions there." ) for key, value in state.text(): print(f"{key}: {value}")
For a list of popular models and supported configurations validated on Gaudi, see the supported models documentation.
Analyzing SGLang Logs¶
This section demonstrates how to interpret SGLang server logs using a basic example that runs the Meta-Llama-3.1-8B model on a Gaudi-based SGLang server with default settings:
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --dtype bfloat16 --tp-size 1 --host 0.0.0.0 --port 30000
Below is an excerpt from the initial server log output:
INFO 2025-01-15 10:30:15 server.py:123] Loading model meta-llama/Meta-Llama-3.1-8B
INFO 2025-01-15 10:30:16 habana_worker.py:89] Initializing Gaudi device: hpu:0
INFO 2025-01-15 10:30:16 habana_worker.py:156] Model loading on hpu:0 took 12.8 GiB of device memory (12.8 GiB/94.62 GiB used) and 950 MiB of host memory
INFO 2025-01-15 10:30:17 habana_worker.py:178] HPU Graph compilation started
INFO 2025-01-15 10:30:18 habana_worker.py:201] Free device memory: 81.82 GiB, 73.64 GiB usable (memory_utilization=0.9), 65.27 GiB reserved for KV cache
INFO 2025-01-15 10:30:18 habana_executor.py:67] # HPU blocks: 4181, # CPU blocks: 512
INFO 2025-01-15 10:30:19 habana_worker.py:234] Cache engine initialization took 65.27 GiB of device memory (78.07 GiB/94.62 GiB used) and 1.2 GiB of host memory
These logs illustrate the memory usage pattern during the model initialization phase. They highlight how much device memory is consumed for loading model weights, compiling HPU Graphs, and allocating KV cache memory prior to the warmup stage. This data is useful for tuning memory utilization and optimizing server performance.
Below is an excerpt from the warmup phase logs:
INFO 2025-01-15 10:30:25 habana_model_runner.py:445] Warmup started with batch_sizes=[1, 2, 4, 8] and seq_lens=[128, 256, 512, 1024, 2048] INFO 2025-01-15 10:30:27 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=128, time=0.12s INFO 2025-01-15 10:30:28 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=256, time=0.15s INFO 2025-01-15 10:30:30 habana_model_runner.py:467] Prefill warmup: batch_size=4, seq_len=1024, time=0.35s INFO 2025-01-15 10:30:32 habana_model_runner.py:489] Decode warmup: batch_size=8, seq_len=128, time=0.08s INFO 2025-01-15 10:30:33 habana_model_runner.py:512] Warmup finished in 8.2 secs, allocated 156 MiB of additional device memory INFO 2025-01-15 10:30:33 server.py:234] SGLang server is ready at http://0.0.0.0:30000
By analyzing this warmup phase, you gain insight into the remaining available memory for runtime overhead and how much headroom remains for further optimization. These details help you fine-tune memory utilization settings by balancing the requirements between HPU Graphs and KV Cache according to your specific workload and performance goals.
Troubleshooting¶
This section provides troubleshooting tips for common OOM errors as well as instructions for reducing server startup time during development.
Troubleshooting OOM Errors¶
Out-of-memory (OOM) errors may occur due to factors such as available HPU memory, model size, or input sequence length. If the standard inference command fails, the following options can help mitigate OOM issues:
Option |
Description |
|---|---|
|
Increases the memory fraction pre-allocated for HPU cache. Helps allocate more space for the KV cache to avoid OOM errors. |
|
Reduces the number of concurrent requests in a batch. This lowers the overall KV cache usage. |
|
Enables tensor parallelism by sharding model weights across multiple HPUs. This frees up memory on each HPU for the KV cache. |
|
Breaks large prefill sequences into smaller chunks. Reduces peak memory consumption during the prefill stage. |
Guidelines for Performance Tuning¶
The following guidelines can help you optimize performance. Results may vary depending on your specific workload and configuration:
Aspect |
Guidelines |
|---|---|
Warmup |
Enable warmup during deployment for optimal performance. Warmup duration depends on factors such as input/output sequence lengths, batch size, number of prefill/decode combinations, and data type. Typically, warmup takes from 30 seconds to a few minutes. |
Memory Management |
HPU Graphs and KV Cache share the same memory pool, therefore, balancing them is crucial:
- Maximizing KV Cache size supports larger batches and higher throughput.
- Enabling HPU Graphs reduces host overhead and can help lower latency.
- Control static memory allocation fraction using |
Batch Size Optimization |
|
Chunked Prefill |
Use |
Tensor Parallelism |
Use |
Data Types |
Choose precision formats based on your needs:
- |
Advanced Features |
|
Performance Monitoring¶
Monitor the following key metrics during inference:
Metric |
Description |
|---|---|
Throughput |
Tokens generated per second |
Latency |
Time to first token and inter-token latency |
Memory Usage |
HPU memory utilization |
Request Queue |
Number of pending requests |
Use the built-in metrics endpoint to track these metrics:
curl http://localhost:30000/metrics
Environment Variables¶
The following table provides the environment variables for performance tuning: