Inference Using SGLang¶

The following sections provide instructions on deploying models using SGLang with Intel® Gaudi® AI accelerator. The document is based on the SGLang Inference Server documentation and the SGLang fork for Gaudi integration. The process involves:

Creating a SGLang Docker image for Gaudi
Building and installing the SGLang Inference Server
Sending an Inference Request

For frequently asked questions about using Gaudi with SGLang, see SGLang with Gaudi FAQs.

Creating a Docker Image for Gaudi¶

Since a SGLang server is launched within a Docker container, a Docker image tailored for Gaudi is needed. Follow the instructions below to create and run a Docker image for SGLang on Gaudi.

Prerequisites¶

Ensure you have Intel Gaudi drivers installed on your host system.
Install Docker and ensure it can access Gaudi devices.

Clone the SGLang repository with Gaudi support:

git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork

Building the Docker Image¶

Build the SGLang Docker image optimized for Gaudi:

docker build -f docker/Dockerfile.gaudi -t sglang-gaudi:latest .

Running the Docker Container¶

Launch the SGLang container with Gaudi device access:

docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
    --cap-add=sys_nice \
    --net=host \
    --ipc=host \
    -v /path/to/models:/models \
    sglang-gaudi:latest

Building and Installing SGLang for Gaudi¶

To build and install SGLang with Gaudi support, follow these steps:

Installation from Source¶

Clone the SGLang repository with Gaudi support:

git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork

Install the required dependencies:
```
pip install -e python[all_hpu]
```

Sending an Inference Request¶

There are multiple methods to send inference requests to SGLang on Gaudi:

OpenAI-Compatible Server: Use SGLang’s OpenAI-compatible API endpoint
Native SGLang API: Use SGLang’s native Python API for more advanced features
HTTP REST API: Send direct HTTP requests to the SGLang server

OpenAI-Compatible Server¶

Start the SGLang server with OpenAI-compatible API:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --host 0.0.0.0 \
    --port 30000 \
    --tp-size 1 \
    --dtype bfloat16

Send requests using OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)

Native SGLang API¶

Use SGLang’s native API for advanced features like structured generation:

import sglang as sgl
from sglang import function, system, user, assistant, gen, select

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

# Set the default backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))

# Run the function
state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions there."
)

for key, value in state.text():
    print(f"{key}: {value}")

For a list of popular models and supported configurations validated on Gaudi, see the supported models documentation.

Understanding SGLang Logs¶

This section uses a basic example to run the Meta-Llama-3.1-8B model on Gaudi SGLang server with default settings:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --dtype bfloat16 --tp-size 1 --host 0.0.0.0 --port 30000

The below shows the initial part of the server log:

  INFO 2025-01-15 10:30:15 server.py:123] Loading model meta-llama/Meta-Llama-3.1-8B
  INFO 2025-01-15 10:30:16 habana_worker.py:89] Initializing Gaudi device: hpu:0
  INFO 2025-01-15 10:30:16 habana_worker.py:156] Model loading on hpu:0 took 12.8 GiB of device memory (12.8 GiB/94.62 GiB used) and 950 MiB of host memory
  INFO 2025-01-15 10:30:17 habana_worker.py:178] HPU Graph compilation started
  INFO 2025-01-15 10:30:18 habana_worker.py:201] Free device memory: 81.82 GiB, 73.64 GiB usable (memory_utilization=0.9), 65.27 GiB reserved for KV cache
  INFO 2025-01-15 10:30:18 habana_executor.py:67] # HPU blocks: 4181, # CPU blocks: 512
  INFO 2025-01-15 10:30:19 habana_worker.py:234] Cache engine initialization took 65.27 GiB of device memory (78.07 GiB/94.62 GiB used) and 1.2 GiB of host memory

This section shows the memory consumption trends of the chosen model. It includes the device memory used for loading the model weights, HPU Graph compilation, and the memory allocation for KV cache before the warmup phase begins. This information can be used to determine optimal memory utilization settings.

The below shows the warmup phase logs:

INFO 2025-01-15 10:30:25 habana_model_runner.py:445] Warmup started with batch_sizes=[1, 2, 4, 8] and seq_lens=[128, 256, 512, 1024, 2048]
INFO 2025-01-15 10:30:27 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=128, time=0.12s
INFO 2025-01-15 10:30:28 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=256, time=0.15s
INFO 2025-01-15 10:30:30 habana_model_runner.py:467] Prefill warmup: batch_size=4, seq_len=1024, time=0.35s
INFO 2025-01-15 10:30:32 habana_model_runner.py:489] Decode warmup: batch_size=8, seq_len=128, time=0.08s
INFO 2025-01-15 10:30:33 habana_model_runner.py:512] Warmup finished in 8.2 secs, allocated 156 MiB of additional device memory
INFO 2025-01-15 10:30:33 server.py:234] SGLang server is ready at http://0.0.0.0:30000

After analyzing this section of the warmup phase logs, you should have a good idea of how much free device memory remains for overhead calculations and how much more could still be utilized by adjusting memory utilization settings. The user is expected to balance the memory requirements for HPU Graphs and KV Cache based on their unique needs.

Basic Troubleshooting for OOM Errors¶

Due to various factors such as available HPU memory, model size, and input sequence length, the standard inference command may not always work for your model, potentially leading to OOM errors. The following steps help mitigate OOM errors:
- Increase --mem-fraction-static - This addresses insufficient available memory. SGLang pre-allocates HPU cache by using this fraction of memory. By increasing this value, you can provide more KV cache space.
- Decrease --max-running-requests - This may reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase --tp-size - This approach shards model weights across multiple HPUs, so each HPU has more memory available for KV cache.
- Use --chunked-prefill-size - This breaks large prefill sequences into smaller chunks to reduce memory pressure.
During the development phase when evaluating a model for inference on SGLang, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the --disable-warmup flag.

Note

You may disable warmup only for development but it is highly recommended to enable it in production.

Guidelines for Performance Tuning¶

Some general guidelines for tweaking performance are noted below. Your results may vary:

Warmup should be enabled during deployment for optimal performance. Warmup time depends on many factors, e.g. input and output sequence length, batch size, number of prefill/decode combinations, and data type. Warmup typically takes 30 seconds to a few minutes, depending on the configuration.
Memory Management: Since HPU Graphs and KV Cache share the same memory pool, a balance is required between the two:
- Maximizing KV Cache size helps to accommodate bigger batches resulting in increased overall throughput.
- Enabling HPU Graphs reduces host overhead times and can be useful for reducing latency.
- Use --mem-fraction-static to control the fraction of memory used for static allocations.
Batch Size Optimization:
- Use --max-running-requests to control the maximum number of concurrent requests.
- Use --max-prefill-tokens to limit the number of tokens processed in prefill stage.
- Use --max-total-tokens to set the maximum total tokens that can be processed.
Chunked Prefill: Use --chunked-prefill-size to break large prefill sequences into smaller chunks. This helps with:
- Better memory utilization
- Reduced time-to-first-token for long sequences
- More stable memory usage patterns
Tensor Parallelism: Use --tp-size to distribute model weights across multiple HPUs:
- Reduces per-device memory usage
- Enables serving larger models
- Can improve throughput for certain workloads
Data Types: Consider using different precision formats:
- --dtype bfloat16 - Standard precision with good balance of accuracy and performance
- --dtype float16 - Lower precision, potentially faster but may affect quality
Advanced Features:
- RadixAttention: Automatic prefix caching for improved efficiency with repeated prefixes
- Speculative Decoding: Use smaller draft models to accelerate generation
- Continuous Batching: Automatic batching of requests for better throughput

Performance Monitoring¶

Monitor key metrics during inference:

Throughput: Tokens per second generated
Latency: Time to first token and inter-token latency
Memory Usage: HPU memory utilization
Request Queue: Number of pending requests

Use the built-in metrics endpoint to track these values:

curl http://localhost:30000/metrics

Environment Variables¶

Key environment variables for performance tuning:

PT_HPU_LAZY_MODE=1 - Enable lazy mode for better performance
PT_HPU_LAZY_MODE=0 - Disable lazy mode for using eager or compile mode
HABANA_VISIBLE_DEVICES=all - Make all HPU devices visible
SGLANG_HPU_SKIP_WARMUP=1 - Disable warmup (development only)

Gaudi Documentation 1.22.1 documentation

Inference Using SGLang

On this Page

Inference Using SGLang¶

Creating a Docker Image for Gaudi¶

Prerequisites¶

Building the Docker Image¶

Running the Docker Container¶

Building and Installing SGLang for Gaudi¶

Installation from Source¶

Sending an Inference Request¶

OpenAI-Compatible Server¶

Native SGLang API¶

Understanding SGLang Logs¶

Basic Troubleshooting for OOM Errors¶

Guidelines for Performance Tuning¶

Performance Monitoring¶

Environment Variables¶