SGLang with Gaudi FAQs

Prerequisites and System Requirements

What are the system requirements for running vLLM on Intel® Gaudi®?

  • Ubuntu 22.04 LTS OS.

  • Python 3.10.

  • Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.

  • Intel Gaudi software version 1.18.0 and above.

  • 32GB+ system RAM (64GB+ recommended for larger models)

Building and Installing SGLang

How do I install SGLang with Gaudi support?

  1. Clone the vLLM fork repository:

    git clone https://github.com/HabanaAI/sglang-fork.git
    cd sglang-fork
    
  2. Install SGLang with all dependencies:

    pip install -e "python[all]"
    
  3. Install Gaudi-specific dependencies:

    pip install habana-torch-plugin habana-torch-dataloader
    
  4. Set the environment variables

    export HABANA_VISIBLE_DEVICES=all
    export PT_HPU_LAZY_MODE=0
    

Can I run SGLang on multiple Gaudi devices?

Yes, use tensor parallelism to distribute the model across multiple Gaudis:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --tp-size 4 \
    --host 0.0.0.0 \
    --port 30000

Model Support and Compatibility

Which models are supported on SGLang with Gaudi?

SGLang on Gaudi supports most popular model architectures including:

  • LLaMA, LLaMA-2, LLaMA-3, LLaMA-3.1, LLaMA-3.2

  • Mistral-7B, Mixtral-8x7B, Mixtral-8x22B

  • Qwen-7B, Qwen-14B, Qwen-72B, Qwen2, Qwen2.5

  • DeepSeek-V2, DeepSeek-V3, DeepSeek-R1

Can I use custom or fine-tuned models?

Yes, SGLang supports custom models as long as they use compatible architectures. For fine-tuned models:

python -m sglang.launch_server \
    --model-path /path/to/your/fine-tuned-model \
    --host 0.0.0.0 \
    --port 30000

Performance and Optimization

How can I improve SGLang performance on Gaudi?

Key optimization strategies:

  1. Enable warmup in production:

    # Don't use --disable-warmup in production
    
  2. Tune memory settings:

    --mem-fraction-static 0.9
    
  3. Use appropriate batch sizes:

    --max-running-requests 64
    
  4. Enable chunked prefill for long sequences:

    --chunked-prefill-size 8192
    

What’s the difference between prefill and decode performance?

  • Prefill: Processing the input prompt - typically measured in tokens/second

  • Decode: Generating output tokens - typically measured in tokens/second per sequence

SGLang optimizes both phases differently. Monitor both metrics to understand overall performance.

Why is my first request slow?

This is normal due to: - JIT compilation of kernels - Memory allocation - Cache warming

Memory Management

How much memory does my model need?

Use the memory estimation tool:

python -m sglang.tools.estimate_memory \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --tp-size 1 \
    --max-seq-len 4096

I’m getting Out of Memory (OOM) errors. How do I fix this?

Try the following solutions in order:

  1. Reduce concurrent requests:

    --max-running-requests 32
    
  2. Use tensor parallelism:

    --tp-size 2  # or higher
    
  3. Reduce sequence length:

    --max-seq-len 2048
    
  4. Lower memory utilization:

    --mem-fraction-static 0.8
    

How do I monitor memory usage?

Use the following methods:

  1. Check HPU memory

    hl-smi
    
  2. Retrieve SGLang memory metrics:

    curl http://localhost:30000/metrics | jq '.memory'
    
  3. Monitor in real-time:

    watch -n 1 'hl-smi | grep Memory'
    

What’s the difference between static and dynamic memory allocation?

  • Static memory: Pre-allocated for model weights and KV cache (controlled by --mem-fraction-static)

  • Dynamic memory: Allocated as needed for intermediate computations

Static allocation is more efficient but less flexible.

API and Integration

How do I use SGLang with the OpenAI Python client?

SGLang provides OpenAI-compatible endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require authentication
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Can I use SGLang with LangChain?

Yes, use the OpenAI integration:

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:30000/v1",
    openai_api_key="EMPTY",
    model_name="meta-llama/Meta-Llama-3.1-8B"
)

How do I stream responses?

Enable streaming in your requests:

# OpenAI-compatible streaming
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

How do I use structured generation (JSON mode)?

Use SGLang’s native API for structured generation:

import sglang as sgl

@sgl.function
def structured_gen(s, query):
    s += sgl.user(f"Generate JSON for: {query}")
    s += sgl.assistant(sgl.gen("json_output", regex=r'\{.*\}'))

result = structured_gen.run(query="A person with name and age")

Troubleshooting

SGLang server won’t start. What should I check?

Debug startup issues:

  1. Check Gaudi devices:

    hl-smi  # Should show available HPUs
    
  2. Verify environment:

    echo $HABANA_VISIBLE_DEVICES
    echo $PT_HPU_LAZY_MODE
    

Requests are hanging or timing out. What’s wrong?

Common causes and solutions:

  • Server overloaded: Reduce --max-running-requests

  • Memory pressure: Check memory usage with hl-smi

  • Network issues: Verify connectivity and firewall settings

  • Warmup not complete: Wait for warmup to finish or check logs