SGLang with Gaudi FAQs
On this Page
SGLang with Gaudi FAQs¶
Prerequisites and System Requirements¶
What are the system requirements for running vLLM on Intel® Gaudi®?
Ubuntu 22.04 LTS OS.
Python 3.10.
Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.
Intel Gaudi software version 1.18.0 and above.
32GB+ system RAM (64GB+ recommended for larger models)
Building and Installing SGLang¶
How do I install SGLang with Gaudi support?
Clone the vLLM fork repository:
git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork
Install SGLang with all dependencies:
pip install -e "python[all]"
Install Gaudi-specific dependencies:
pip install habana-torch-plugin habana-torch-dataloader
Set the environment variables
export HABANA_VISIBLE_DEVICES=all export PT_HPU_LAZY_MODE=0
Can I run SGLang on multiple Gaudi devices?
Yes, use tensor parallelism to distribute the model across multiple Gaudis:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B \
--tp-size 4 \
--host 0.0.0.0 \
--port 30000
Model Support and Compatibility¶
Which models are supported on SGLang with Gaudi?
SGLang on Gaudi supports most popular model architectures including:
LLaMA, LLaMA-2, LLaMA-3, LLaMA-3.1, LLaMA-3.2
Mistral-7B, Mixtral-8x7B, Mixtral-8x22B
Qwen-7B, Qwen-14B, Qwen-72B, Qwen2, Qwen2.5
DeepSeek-V2, DeepSeek-V3, DeepSeek-R1
Can I use custom or fine-tuned models?
Yes, SGLang supports custom models as long as they use compatible architectures. For fine-tuned models:
python -m sglang.launch_server \
--model-path /path/to/your/fine-tuned-model \
--host 0.0.0.0 \
--port 30000
Performance and Optimization¶
How can I improve SGLang performance on Gaudi?
Key optimization strategies:
Enable warmup in production:
# Don't use --disable-warmup in productionTune memory settings:
--mem-fraction-static 0.9
Use appropriate batch sizes:
--max-running-requests 64
Enable chunked prefill for long sequences:
--chunked-prefill-size 8192
What’s the difference between prefill and decode performance?
Prefill: Processing the input prompt - typically measured in tokens/second
Decode: Generating output tokens - typically measured in tokens/second per sequence
SGLang optimizes both phases differently. Monitor both metrics to understand overall performance.
Why is my first request slow?
This is normal due to: - JIT compilation of kernels - Memory allocation - Cache warming
Memory Management¶
How much memory does my model need?
Use the memory estimation tool:
python -m sglang.tools.estimate_memory \
--model-path meta-llama/Meta-Llama-3.1-8B \
--tp-size 1 \
--max-seq-len 4096
I’m getting Out of Memory (OOM) errors. How do I fix this?
Try the following solutions in order:
Reduce concurrent requests:
--max-running-requests 32
Use tensor parallelism:
--tp-size 2 # or higher
Reduce sequence length:
--max-seq-len 2048
Lower memory utilization:
--mem-fraction-static 0.8
How do I monitor memory usage?
Use the following methods:
Check HPU memory
hl-smi
Retrieve SGLang memory metrics:
curl http://localhost:30000/metrics | jq '.memory'
Monitor in real-time:
watch -n 1 'hl-smi | grep Memory'
What’s the difference between static and dynamic memory allocation?
Static memory: Pre-allocated for model weights and KV cache (controlled by
--mem-fraction-static)Dynamic memory: Allocated as needed for intermediate computations
Static allocation is more efficient but less flexible.
API and Integration¶
How do I use SGLang with the OpenAI Python client?
SGLang provides OpenAI-compatible endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY" # SGLang doesn't require authentication
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B",
messages=[
{"role": "user", "content": "Hello!"}
]
)
Can I use SGLang with LangChain?
Yes, use the OpenAI integration:
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_base="http://localhost:30000/v1",
openai_api_key="EMPTY",
model_name="meta-llama/Meta-Llama-3.1-8B"
)
How do I stream responses?
Enable streaming in your requests:
# OpenAI-compatible streaming
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
How do I use structured generation (JSON mode)?
Use SGLang’s native API for structured generation:
import sglang as sgl
@sgl.function
def structured_gen(s, query):
s += sgl.user(f"Generate JSON for: {query}")
s += sgl.assistant(sgl.gen("json_output", regex=r'\{.*\}'))
result = structured_gen.run(query="A person with name and age")
Troubleshooting¶
SGLang server won’t start. What should I check?
Debug startup issues:
Check Gaudi devices:
hl-smi # Should show available HPUs
Verify environment:
echo $HABANA_VISIBLE_DEVICES echo $PT_HPU_LAZY_MODE
Requests are hanging or timing out. What’s wrong?
Common causes and solutions:
Server overloaded: Reduce
--max-running-requestsMemory pressure: Check memory usage with
hl-smiNetwork issues: Verify connectivity and firewall settings
Warmup not complete: Wait for warmup to finish or check logs