Inference Using SGLang

The following sections provide instructions on deploying models using SGLang with Intel® Gaudi® AI accelerator. The document is based on the SGLang Inference Server documentation and the SGLang fork for Gaudi integration. The process involves:

  • Creating a SGLang Docker image for Gaudi

  • Building and installing the SGLang Inference Server

  • Sending an Inference Request

For frequently asked questions about using Gaudi with SGLang, see SGLang with Gaudi FAQs.

Creating a Docker Image for Gaudi

Since a SGLang server is launched within a Docker container, a Docker image tailored for Gaudi is needed. Follow the instructions below to create and run a Docker image for SGLang on Gaudi.

Prerequisites

  1. Ensure you have Intel Gaudi drivers installed on your host system.

  2. Install Docker and ensure it can access Gaudi devices.

  3. Clone the SGLang repository with Gaudi support:

    git clone https://github.com/HabanaAI/sglang-fork.git
    cd sglang-fork
    

Building the Docker Image

Build the SGLang Docker image optimized for Gaudi:

docker build -f docker/Dockerfile.gaudi -t sglang-gaudi:latest .

Running the Docker Container

Launch the SGLang container with Gaudi device access:

docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
    --cap-add=sys_nice \
    --net=host \
    --ipc=host \
    -v /path/to/models:/models \
    sglang-gaudi:latest

Building and Installing SGLang for Gaudi

To build and install SGLang with Gaudi support, follow these steps:

Installation from Source

  1. Clone the SGLang repository with Gaudi support:

    git clone https://github.com/HabanaAI/sglang-fork.git
    cd sglang-fork
    
  2. Install the required dependencies:

    pip install -e python[all_hpu]
    

Sending an Inference Request

There are multiple methods to send inference requests to SGLang on Gaudi:

  • OpenAI-Compatible Server: Use SGLang’s OpenAI-compatible API endpoint

  • Native SGLang API: Use SGLang’s native Python API for more advanced features

  • HTTP REST API: Send direct HTTP requests to the SGLang server

OpenAI-Compatible Server

Start the SGLang server with OpenAI-compatible API:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --host 0.0.0.0 \
    --port 30000 \
    --tp-size 1 \
    --dtype bfloat16

Send requests using OpenAI Python client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[
        {"role": "user", "content": "What is the capital of France?"}
    ],
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].message.content)

Native SGLang API

Use SGLang’s native API for advanced features like structured generation:

import sglang as sgl
from sglang import function, system, user, assistant, gen, select

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

# Set the default backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))

# Run the function
state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions there."
)

for key, value in state.text():
    print(f"{key}: {value}")

For a list of popular models and supported configurations validated on Gaudi, see the supported models documentation.

Understanding SGLang Logs

This section uses a basic example to run the Meta-Llama-3.1-8B model on Gaudi SGLang server with default settings:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --dtype bfloat16 --tp-size 1 --host 0.0.0.0 --port 30000

The below shows the initial part of the server log:

  INFO 2025-01-15 10:30:15 server.py:123] Loading model meta-llama/Meta-Llama-3.1-8B
  INFO 2025-01-15 10:30:16 habana_worker.py:89] Initializing Gaudi device: hpu:0
  INFO 2025-01-15 10:30:16 habana_worker.py:156] Model loading on hpu:0 took 12.8 GiB of device memory (12.8 GiB/94.62 GiB used) and 950 MiB of host memory
  INFO 2025-01-15 10:30:17 habana_worker.py:178] HPU Graph compilation started
  INFO 2025-01-15 10:30:18 habana_worker.py:201] Free device memory: 81.82 GiB, 73.64 GiB usable (memory_utilization=0.9), 65.27 GiB reserved for KV cache
  INFO 2025-01-15 10:30:18 habana_executor.py:67] # HPU blocks: 4181, # CPU blocks: 512
  INFO 2025-01-15 10:30:19 habana_worker.py:234] Cache engine initialization took 65.27 GiB of device memory (78.07 GiB/94.62 GiB used) and 1.2 GiB of host memory

This section shows the memory consumption trends of the chosen model. It includes the device memory used for loading the model weights, HPU Graph compilation, and the memory allocation for KV cache before the warmup phase begins. This information can be used to determine optimal memory utilization settings.

The below shows the warmup phase logs:

INFO 2025-01-15 10:30:25 habana_model_runner.py:445] Warmup started with batch_sizes=[1, 2, 4, 8] and seq_lens=[128, 256, 512, 1024, 2048]
INFO 2025-01-15 10:30:27 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=128, time=0.12s
INFO 2025-01-15 10:30:28 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=256, time=0.15s
INFO 2025-01-15 10:30:30 habana_model_runner.py:467] Prefill warmup: batch_size=4, seq_len=1024, time=0.35s
INFO 2025-01-15 10:30:32 habana_model_runner.py:489] Decode warmup: batch_size=8, seq_len=128, time=0.08s
INFO 2025-01-15 10:30:33 habana_model_runner.py:512] Warmup finished in 8.2 secs, allocated 156 MiB of additional device memory
INFO 2025-01-15 10:30:33 server.py:234] SGLang server is ready at http://0.0.0.0:30000

After analyzing this section of the warmup phase logs, you should have a good idea of how much free device memory remains for overhead calculations and how much more could still be utilized by adjusting memory utilization settings. The user is expected to balance the memory requirements for HPU Graphs and KV Cache based on their unique needs.

Basic Troubleshooting for OOM Errors

  • Due to various factors such as available HPU memory, model size, and input sequence length, the standard inference command may not always work for your model, potentially leading to OOM errors. The following steps help mitigate OOM errors:

    • Increase --mem-fraction-static - This addresses insufficient available memory. SGLang pre-allocates HPU cache by using this fraction of memory. By increasing this value, you can provide more KV cache space.

    • Decrease --max-running-requests - This may reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.

    • Increase --tp-size - This approach shards model weights across multiple HPUs, so each HPU has more memory available for KV cache.

    • Use --chunked-prefill-size - This breaks large prefill sequences into smaller chunks to reduce memory pressure.

  • During the development phase when evaluating a model for inference on SGLang, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the --disable-warmup flag.

    Note

    You may disable warmup only for development but it is highly recommended to enable it in production.

Guidelines for Performance Tuning

Some general guidelines for tweaking performance are noted below. Your results may vary:

  • Warmup should be enabled during deployment for optimal performance. Warmup time depends on many factors, e.g. input and output sequence length, batch size, number of prefill/decode combinations, and data type. Warmup typically takes 30 seconds to a few minutes, depending on the configuration.

  • Memory Management: Since HPU Graphs and KV Cache share the same memory pool, a balance is required between the two:

    • Maximizing KV Cache size helps to accommodate bigger batches resulting in increased overall throughput.

    • Enabling HPU Graphs reduces host overhead times and can be useful for reducing latency.

    • Use --mem-fraction-static to control the fraction of memory used for static allocations.

  • Batch Size Optimization:

    • Use --max-running-requests to control the maximum number of concurrent requests.

    • Use --max-prefill-tokens to limit the number of tokens processed in prefill stage.

    • Use --max-total-tokens to set the maximum total tokens that can be processed.

  • Chunked Prefill: Use --chunked-prefill-size to break large prefill sequences into smaller chunks. This helps with:

    • Better memory utilization

    • Reduced time-to-first-token for long sequences

    • More stable memory usage patterns

  • Tensor Parallelism: Use --tp-size to distribute model weights across multiple HPUs:

    • Reduces per-device memory usage

    • Enables serving larger models

    • Can improve throughput for certain workloads

  • Data Types: Consider using different precision formats:

    • --dtype bfloat16 - Standard precision with good balance of accuracy and performance

    • --dtype float16 - Lower precision, potentially faster but may affect quality

  • Advanced Features:

    • RadixAttention: Automatic prefix caching for improved efficiency with repeated prefixes

    • Speculative Decoding: Use smaller draft models to accelerate generation

    • Continuous Batching: Automatic batching of requests for better throughput

Performance Monitoring

Monitor key metrics during inference:

  • Throughput: Tokens per second generated

  • Latency: Time to first token and inter-token latency

  • Memory Usage: HPU memory utilization

  • Request Queue: Number of pending requests

Use the built-in metrics endpoint to track these values:

curl http://localhost:30000/metrics

Environment Variables

Key environment variables for performance tuning:

  • PT_HPU_LAZY_MODE=1 - Enable lazy mode for better performance

  • PT_HPU_LAZY_MODE=0 - Disable lazy mode for using eager or compile mode

  • HABANA_VISIBLE_DEVICES=all - Make all HPU devices visible

  • SGLANG_HPU_SKIP_WARMUP=1 - Disable warmup (development only)