SGLang with Gaudi FAQs
On this Page
SGLang with Gaudi FAQs¶
This document provides answers to frequently asked questions about using SGLang with Intel® Gaudi® AI accelerator.
Installation and Setup¶
Q: How do I install SGLang with Gaudi support?
A: Follow these steps:
# Clone the Gaudi-optimized SGLang repository
git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork
# Install SGLang with all dependencies
pip install -e "python[all]"
# Install Gaudi-specific dependencies
pip install habana-torch-plugin habana-torch-dataloader
# Set environment variables
export HABANA_VISIBLE_DEVICES=all
export PT_HPU_LAZY_MODE=0
Q: What are the minimum system requirements for SGLang on Gaudi?
A: - Intel Gaudi2 or newer HPU - 32GB+ system RAM (64GB+ recommended for larger models) - Python 3.10 or later - Ubuntu 20.04+ or similar Linux distribution - Gaudi drivers and software stack installed
Q: Can I run SGLang on multiple Gaudi devices?
A: Yes, use tensor parallelism to distribute the model across multiple HPUs:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B \
--tp-size 4 \
--host 0.0.0.0 \
--port 30000
Model Support and Compatibility¶
Q: Which models are supported on SGLang with Gaudi?
A: SGLang on Gaudi supports most popular model architectures including:
LLaMA family: LLaMA, LLaMA-2, LLaMA-3, LLaMA-3.1, LLaMA-3.2
Mistral family: Mistral-7B, Mixtral-8x7B, Mixtral-8x22B
Qwen family: Qwen-7B, Qwen-14B, Qwen-72B, Qwen2, Qwen2.5
DeepSeek family: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1
Q: Can I use custom or fine-tuned models?
A: Yes, SGLang supports custom models as long as they use compatible architectures. For fine-tuned models:
python -m sglang.launch_server \
--model-path /path/to/your/fine-tuned-model \
--host 0.0.0.0 \
--port 30000
Performance and Optimization¶
Q: How can I improve SGLang performance on Gaudi?
A: Key optimization strategies:
Enable warmup in production:
# Don't use --disable-warmup in production
Tune memory settings:
--mem-fraction-static 0.9
Use appropriate batch sizes:
--max-running-requests 64
Enable chunked prefill for long sequences:
--chunked-prefill-size 8192
Q: What’s the difference between prefill and decode performance?
A: - Prefill: Processing the input prompt - typically measured in tokens/second - Decode: Generating output tokens - typically measured in tokens/second per sequence
SGLang optimizes both phases differently. Monitor both metrics to understand overall performance.
Q: Why is my first request slow?
A: This is normal due to: - JIT compilation of kernels - Memory allocation - Cache warming
Memory Management¶
Q: How much memory does my model need?
A: Use the memory estimation tool:
python -m sglang.tools.estimate_memory \
--model-path meta-llama/Meta-Llama-3.1-8B \
--tp-size 1 \
--max-seq-len 4096
Q: I’m getting Out of Memory (OOM) errors. How do I fix this?
A: Try these solutions in order:
Reduce concurrent requests:
--max-running-requests 32
Use tensor parallelism:
--tp-size 2 # or higher
Reduce sequence length:
--max-seq-len 2048
Lower memory utilization:
--mem-fraction-static 0.8
Q: How do I monitor memory usage?
A: Use these methods:
# Check HPU memory
hl-smi
# Get SGLang memory metrics
curl http://localhost:30000/metrics | jq '.memory'
# Monitor in real-time
watch -n 1 'hl-smi | grep Memory'
Q: What’s the difference between static and dynamic memory allocation?
A:
- Static memory: Pre-allocated for model weights and KV cache (controlled by --mem-fraction-static
)
- Dynamic memory: Allocated as needed for intermediate computations
Static allocation is more efficient but less flexible.
API and Integration¶
Q: How do I use SGLang with the OpenAI Python client?
A: SGLang provides OpenAI-compatible endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY" # SGLang doesn't require authentication
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B",
messages=[
{"role": "user", "content": "Hello!"}
]
)
Q: Can I use SGLang with LangChain?
A: Yes, use the OpenAI integration:
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_base="http://localhost:30000/v1",
openai_api_key="EMPTY",
model_name="meta-llama/Meta-Llama-3.1-8B"
)
Q: How do I stream responses?
A: Enable streaming in your requests:
# OpenAI-compatible streaming
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Q: How do I use structured generation (JSON mode)?
A: Use SGLang’s native API for structured generation:
import sglang as sgl
@sgl.function
def structured_gen(s, query):
s += sgl.user(f"Generate JSON for: {query}")
s += sgl.assistant(sgl.gen("json_output", regex=r'\{.*\}'))
result = structured_gen.run(query="A person with name and age")
Troubleshooting¶
Q: SGLang server won’t start. What should I check?
A: Debug startup issues:
Check Gaudi devices:
hl-smi # Should show available HPUs
Verify environment:
echo $HABANA_VISIBLE_DEVICES echo $PT_HPU_LAZY_MODE
Q: Requests are hanging or timing out. What’s wrong?
A: Common causes and solutions:
Server overloaded: Reduce
--max-running-requests
Memory pressure: Check memory usage with
hl-smi
Network issues: Verify connectivity and firewall settings
Warmup not complete: Wait for warmup to finish or check logs