Inference Using SGLang

This document outlines the process of deploying models with SGLang using the Intel® Gaudi® AI accelerator. It covers how to create a Gaudi-compatible SGLang Docker image, build and install the SGLang Inference Server, and send inference requests.

Prerequisites

Creating and Running a Docker Image for Gaudi

Since a SGLang server is launched within a Docker container, a Docker image tailored for Gaudi is needed. Follow the instructions below to create and run a Docker image for SGLang on Gaudi.

  1. Clone SGLang repository:

    git clone https://github.com/HabanaAI/sglang-fork.git
    cd sglang-fork
    
  2. Build the SGLang Docker image optimized for Gaudi:

    docker build -f docker/Dockerfile.gaudi -t sglang-gaudi:latest .
    
  3. Launch the SGLang container with Gaudi device access:

    docker run -it --runtime=habana \
        -e HABANA_VISIBLE_DEVICES=all \
        -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
        --cap-add=sys_nice \
        --net=host \
        --ipc=host \
        -v /path/to/models:/models \
        sglang-gaudi:latest
    

Building and Installing SGLang for Gaudi

To build and install SGLang with Gaudi, follow the steps below:

  1. Clone the SGLang repository with Gaudi support:

    git clone https://github.com/HabanaAI/sglang-fork.git
    cd sglang-fork
    
  2. Install the required dependencies:

    pip install -e python[all_hpu]
    

Sending an Inference Request

The following sections describe the available methods to send inference requests to SGLang running on Gaudi.

Use an API endpoint that mimics the OpenAI interface, allowing for easy integration with existing OpenAI-compatible clients.

  1. Start the SGLang server with OpenAI-compatible API:

    python -m sglang.launch_server \
        --model-path meta-llama/Meta-Llama-3.1-8B \
        --host 0.0.0.0 \
        --port 30000 \
        --tp-size 1 \
        --dtype bfloat16
    
  2. Send requests using OpenAI Python client:

    from openai import OpenAI
    
    client = OpenAI(
        base_url="http://localhost:30000/v1",
        api_key="EMPTY"
    )
    
    response = client.chat.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B",
        messages=[
            {"role": "user", "content": "What is the capital of France?"}
        ],
        max_tokens=100,
        temperature=0.7
    )
    
    print(response.choices[0].message.content)
    

Use SGLang’s native API for advanced features such as structured generation:

import sglang as sgl
from sglang import function, system, user, assistant, gen, select

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

# Set the default backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))

# Run the function
state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions there."
)

for key, value in state.text():
    print(f"{key}: {value}")

For a list of popular models and supported configurations validated on Gaudi, see the supported models documentation.

Analyzing SGLang Logs

This section demonstrates how to interpret SGLang server logs using a basic example that runs the Meta-Llama-3.1-8B model on a Gaudi-based SGLang server with default settings:

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B --dtype bfloat16 --tp-size 1 --host 0.0.0.0 --port 30000

Below is an excerpt from the initial server log output:

    INFO 2025-01-15 10:30:15 server.py:123] Loading model meta-llama/Meta-Llama-3.1-8B
    INFO 2025-01-15 10:30:16 habana_worker.py:89] Initializing Gaudi device: hpu:0
    INFO 2025-01-15 10:30:16 habana_worker.py:156] Model loading on hpu:0 took 12.8 GiB of device memory (12.8 GiB/94.62 GiB used) and 950 MiB of host memory
    INFO 2025-01-15 10:30:17 habana_worker.py:178] HPU Graph compilation started
    INFO 2025-01-15 10:30:18 habana_worker.py:201] Free device memory: 81.82 GiB, 73.64 GiB usable (memory_utilization=0.9), 65.27 GiB reserved for KV cache
    INFO 2025-01-15 10:30:18 habana_executor.py:67] # HPU blocks: 4181, # CPU blocks: 512
    INFO 2025-01-15 10:30:19 habana_worker.py:234] Cache engine initialization took 65.27 GiB of device memory (78.07 GiB/94.62 GiB used) and 1.2 GiB of host memory

These logs illustrate the memory usage pattern during the model initialization phase. They highlight how much device memory is consumed for loading model weights, compiling HPU Graphs, and allocating KV cache memory prior to the warmup stage. This data is useful for tuning memory utilization and optimizing server performance.

Below is an excerpt from the warmup phase logs:

INFO 2025-01-15 10:30:25 habana_model_runner.py:445] Warmup started with batch_sizes=[1, 2, 4, 8] and seq_lens=[128, 256, 512, 1024, 2048]
INFO 2025-01-15 10:30:27 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=128, time=0.12s
INFO 2025-01-15 10:30:28 habana_model_runner.py:467] Prefill warmup: batch_size=1, seq_len=256, time=0.15s
INFO 2025-01-15 10:30:30 habana_model_runner.py:467] Prefill warmup: batch_size=4, seq_len=1024, time=0.35s
INFO 2025-01-15 10:30:32 habana_model_runner.py:489] Decode warmup: batch_size=8, seq_len=128, time=0.08s
INFO 2025-01-15 10:30:33 habana_model_runner.py:512] Warmup finished in 8.2 secs, allocated 156 MiB of additional device memory
INFO 2025-01-15 10:30:33 server.py:234] SGLang server is ready at http://0.0.0.0:30000

By analyzing this warmup phase, you gain insight into the remaining available memory for runtime overhead and how much headroom remains for further optimization. These details help you fine-tune memory utilization settings by balancing the requirements between HPU Graphs and KV Cache according to your specific workload and performance goals.

Troubleshooting

This section provides troubleshooting tips for common OOM errors as well as instructions for reducing server startup time during development.

Troubleshooting OOM Errors

Out-of-memory (OOM) errors may occur due to factors such as available HPU memory, model size, or input sequence length. If the standard inference command fails, the following options can help mitigate OOM issues:

Option

Description

--mem-fraction-static

Increases the memory fraction pre-allocated for HPU cache. Helps allocate more space for the KV cache to avoid OOM errors.

--max-running-requests

Reduces the number of concurrent requests in a batch. This lowers the overall KV cache usage.

--tp-size

Enables tensor parallelism by sharding model weights across multiple HPUs. This frees up memory on each HPU for the KV cache.

--chunked-prefill-size

Breaks large prefill sequences into smaller chunks. Reduces peak memory consumption during the prefill stage.

Guidelines for Performance Tuning

The following guidelines can help you optimize performance. Results may vary depending on your specific workload and configuration:

Aspect

Guidelines

Warmup

Enable warmup during deployment for optimal performance. Warmup duration depends on factors such as input/output sequence lengths, batch size, number of prefill/decode combinations, and data type. Typically, warmup takes from 30 seconds to a few minutes.

Memory Management

HPU Graphs and KV Cache share the same memory pool, therefore, balancing them is crucial: - Maximizing KV Cache size supports larger batches and higher throughput. - Enabling HPU Graphs reduces host overhead and can help lower latency. - Control static memory allocation fraction using --mem-fraction-static.

Batch Size Optimization

  • Use --max-running-requests to limit the number of concurrent requests.

  • Use --max-prefill-tokens to restrict tokens processed during the prefill stage.

  • Use --max-total-tokens to set the maximum total tokens processed per request.

Chunked Prefill

Use --chunked-prefill-size to split large prefill sequences into smaller chunks, which helps: - Improve memory utilization - Reduce time-to-first-token for long sequences - Stabilize memory usage patterns

Tensor Parallelism

Use --tp-size to shard model weights across multiple HPUs, which: - Reduces per-device memory usage - Enables serving larger models - Can improve throughput for certain workloads

Data Types

Choose precision formats based on your needs: - --dtype bfloat16: Standard precision balancing accuracy and performance - --dtype float16: Lower precision with potential speed gains, may affect output quality

Advanced Features

  • RadixAttention: Enables automatic prefix caching for efficiency with repeated prefixes.

  • Speculative Decoding: Uses smaller draft models to accelerate generation.

  • Continuous Batching: Automatically batches requests to improve throughput.

Performance Monitoring

Monitor the following key metrics during inference:

Metric

Description

Throughput

Tokens generated per second

Latency

Time to first token and inter-token latency

Memory Usage

HPU memory utilization

Request Queue

Number of pending requests

Use the built-in metrics endpoint to track these metrics:

curl http://localhost:30000/metrics

Environment Variables

The following table provides the environment variables for performance tuning: