SGLang Quick Start Guide
On this Page
SGLang Quick Start Guide¶
This quick start guide helps you get SGLang running on Intel® Gaudi® AI accelerator in just a few steps.
Prerequisites¶
Intel Gaudi drivers installed on your system
Python 3.10 or later
Docker (optional, for containerized deployment)
Quick Installation¶
Install SGLang with Gaudi Support
# Clone the Gaudi-optimized SGLang repository git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork # Install SGLang pip install -e "python[all_hpu]" # Install Gaudi dependencies pip install habana-torch-plugin habana-torch-dataloader
Set Environment Variables
export HABANA_VISIBLE_DEVICES=all export PT_HPU_LAZY_MODE=0
Download a Model (Optional - SGLang will auto-download)
# Example: Download Llama-3.1-8B huggingface-cli download meta-llama/Meta-Llama-3.1-8B
Quick Start Examples¶
Example 1: Start SGLang Server¶
Start SGLang server with a popular model:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B \
--host 0.0.0.0 \
--port 30000 \
--tp-size 1
Example 2: Send a Simple Request¶
Use curl to send a request:
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "The capital of France is",
"sampling_params": {"temperature": 0.7, "max_new_tokens": 64}
}'
Example 3: Use OpenAI-Compatible API¶
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B",
messages=[{"role": "user", "content": "Hello! How are you?"}],
max_tokens=100
)
print(response.choices[0].message.content)
Example 4: Use Native SGLang API¶
import sglang as sgl
from sglang import function, system, user, assistant, gen
@function
def simple_chat(s, question):
s += system("You are a helpful assistant.")
s += user(question)
s += assistant(gen("answer", max_tokens=100))
# Set backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))
# Run
state = simple_chat.run(question="What is machine learning?")
print(state["answer"])
Docker Quick Start¶
Using Pre-built Container¶
# Pull and run SGLang container
docker run -it --runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e PT_HPU_LAZY_MODE=0 \
--net=host \
sglang-gaudi:latest \
python -m sglang.launch_server \
--model-path microsoft/DialoGPT-medium \
--host 0.0.0.0 \
--port 30000
Building Your Own Container¶
# Clone repository
git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork
# Build Docker image
docker build -f docker/Dockerfile.gaudi -t my-sglang-gaudi .
# Run container
docker run -it --runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
--net=host \
my-sglang-gaudi
Common Parameters¶
Key parameters for SGLang server:
Parameter |
Default |
Description |
---|---|---|
|
Required |
Path or name of the model to serve |
|
1 |
Tensor parallelism size (number of HPUs) |
|
127.0.0.1 |
Host address to bind the server |
|
30000 |
Port number for the server |
|
auto |
Data type (bfloat16, float16, float32) |
|
1000 |
Maximum concurrent requests |
|
0.9 |
Fraction of memory for static allocation |
|
False |
Skip warmup phase (dev only) |
Quick Troubleshooting¶
Server won’t start:
Check if Gaudi drivers are installed:
hl-smi
Verify environment variables are set
Ensure sufficient memory for the model
Out of Memory errors:
Increase
--tp-size
to distribute across more HPUsUse smaller model or lower precision (
--dtype bfloat16
)
Slow performance:
Enable warmup in production (remove
--disable-warmup
)Tune
--mem-fraction-static
Consider using
--chunked-prefill-size
Connection issues:
Check firewall settings for the port
Use
--host 0.0.0.0
to bind to all interfacesVerify the server started successfully in logs
Next Steps¶
Read the full Inference Using SGLang guide
Explore performance tuning in Managing and Reducing SGLang Warmup Time
Check out SGLang with Gaudi FAQs for common questions
For more detailed information and advanced configurations, continue to the full documentation sections.