SGLang with Gaudi FAQs¶

This document provides answers to frequently asked questions about using SGLang with Intel® Gaudi® AI accelerator.

Installation and Setup¶

Q: How do I install SGLang with Gaudi support?

A: Follow these steps:

# Clone the Gaudi-optimized SGLang repository
git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork

# Install SGLang with all dependencies
pip install -e "python[all]"

# Install Gaudi-specific dependencies
pip install habana-torch-plugin habana-torch-dataloader

# Set environment variables
export HABANA_VISIBLE_DEVICES=all
export PT_HPU_LAZY_MODE=0

Q: What are the minimum system requirements for SGLang on Gaudi?

A: - Intel Gaudi2 or newer HPU - 32GB+ system RAM (64GB+ recommended for larger models) - Python 3.10 or later - Ubuntu 20.04+ or similar Linux distribution - Gaudi drivers and software stack installed

Q: Can I run SGLang on multiple Gaudi devices?

A: Yes, use tensor parallelism to distribute the model across multiple HPUs:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --tp-size 4 \
    --host 0.0.0.0 \
    --port 30000

Model Support and Compatibility¶

Q: Which models are supported on SGLang with Gaudi?

A: SGLang on Gaudi supports most popular model architectures including:

LLaMA family: LLaMA, LLaMA-2, LLaMA-3, LLaMA-3.1, LLaMA-3.2
Mistral family: Mistral-7B, Mixtral-8x7B, Mixtral-8x22B
Qwen family: Qwen-7B, Qwen-14B, Qwen-72B, Qwen2, Qwen2.5
DeepSeek family: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1

Q: Can I use custom or fine-tuned models?

A: Yes, SGLang supports custom models as long as they use compatible architectures. For fine-tuned models:

python -m sglang.launch_server \
    --model-path /path/to/your/fine-tuned-model \
    --host 0.0.0.0 \
    --port 30000

Performance and Optimization¶

Q: How can I improve SGLang performance on Gaudi?

A: Key optimization strategies:

Enable warmup in production:

# Don't use --disable-warmup in production

Tune memory settings:
```
--mem-fraction-static 0.9
```
Use appropriate batch sizes:
```
--max-running-requests 64
```
Enable chunked prefill for long sequences:
```
--chunked-prefill-size 8192
```

Q: What’s the difference between prefill and decode performance?

A: - Prefill: Processing the input prompt - typically measured in tokens/second - Decode: Generating output tokens - typically measured in tokens/second per sequence

SGLang optimizes both phases differently. Monitor both metrics to understand overall performance.

Q: Why is my first request slow?

A: This is normal due to: - JIT compilation of kernels - Memory allocation - Cache warming

Memory Management¶

Q: How much memory does my model need?

A: Use the memory estimation tool:

python -m sglang.tools.estimate_memory \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --tp-size 1 \
    --max-seq-len 4096

Q: I’m getting Out of Memory (OOM) errors. How do I fix this?

A: Try these solutions in order:

Reduce concurrent requests:
```
--max-running-requests 32
```
Use tensor parallelism:
```
--tp-size 2  # or higher
```
Reduce sequence length:
```
--max-seq-len 2048
```
Lower memory utilization:
```
--mem-fraction-static 0.8
```

Q: How do I monitor memory usage?

A: Use these methods:

# Check HPU memory
hl-smi

# Get SGLang memory metrics
curl http://localhost:30000/metrics | jq '.memory'

# Monitor in real-time
watch -n 1 'hl-smi | grep Memory'

Q: What’s the difference between static and dynamic memory allocation?

A: - Static memory: Pre-allocated for model weights and KV cache (controlled by --mem-fraction-static) - Dynamic memory: Allocated as needed for intermediate computations

Static allocation is more efficient but less flexible.

API and Integration¶

Q: How do I use SGLang with the OpenAI Python client?

A: SGLang provides OpenAI-compatible endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require authentication
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Q: Can I use SGLang with LangChain?

A: Yes, use the OpenAI integration:

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:30000/v1",
    openai_api_key="EMPTY",
    model_name="meta-llama/Meta-Llama-3.1-8B"
)

Q: How do I stream responses?

A: Enable streaming in your requests:

# OpenAI-compatible streaming
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Q: How do I use structured generation (JSON mode)?

A: Use SGLang’s native API for structured generation:

import sglang as sgl

@sgl.function
def structured_gen(s, query):
    s += sgl.user(f"Generate JSON for: {query}")
    s += sgl.assistant(sgl.gen("json_output", regex=r'\{.*\}'))

result = structured_gen.run(query="A person with name and age")

Troubleshooting¶

Q: SGLang server won’t start. What should I check?

A: Debug startup issues:

Check Gaudi devices:
```
hl-smi  # Should show available HPUs
```

Verify environment:

echo $HABANA_VISIBLE_DEVICES
echo $PT_HPU_LAZY_MODE

Q: Requests are hanging or timing out. What’s wrong?

A: Common causes and solutions:

Server overloaded: Reduce --max-running-requests
Memory pressure: Check memory usage with hl-smi
Network issues: Verify connectivity and firewall settings
Warmup not complete: Wait for warmup to finish or check logs

Gaudi Documentation 1.22.1 documentation

SGLang with Gaudi FAQs

On this Page