SGLang Quick Start Guide
On this Page
SGLang Quick Start Guide¶
This quick start guide helps you get SGLang running on Intel® Gaudi® AI accelerator in just a few steps.
Prerequisites¶
Intel® Gaudi® software and drivers installed. Refer to Driver and Software Installation.
Python 3.10 or later
Dockers (optional - for containerized deployment). Refer to Docker Installation.
Running SGLang on Intel Gaudi¶
Install SGLang with Gaudi Support:
Clone the vLLM fork repository and navigate to the appropriate directory:
git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork
Install SGLang:
pip install -e "python[all_hpu]"
Install Gaudi dependencies:
pip install habana-torch-plugin habana-torch-dataloader
Set the environment variables:
export HABANA_VISIBLE_DEVICES=all export PT_HPU_LAZY_MODE=0
Quick Start Examples¶
Start SGLang server with a popular model:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B \
--host 0.0.0.0 \
--port 30000 \
--tp-size 1
Use curl to send a request:
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "The capital of France is",
"sampling_params": {"temperature": 0.7, "max_new_tokens": 64}
}'
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B",
messages=[{"role": "user", "content": "Hello! How are you?"}],
max_tokens=100
)
print(response.choices[0].message.content)
import sglang as sgl
from sglang import function, system, user, assistant, gen
@function
def simple_chat(s, question):
s += system("You are a helpful assistant.")
s += user(question)
s += assistant(gen("answer", max_tokens=100))
# Set backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))
# Run
state = simple_chat.run(question="What is machine learning?")
print(state["answer"])
Docker Quick Start¶
# Pull and run SGLang container
docker run -it --runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e PT_HPU_LAZY_MODE=0 \
--net=host \
sglang-gaudi:latest \
python -m sglang.launch_server \
--model-path microsoft/DialoGPT-medium \
--host 0.0.0.0 \
--port 30000
Clone the SGLang fork repository and navigate to the appropriate directory:
git clone https://github.com/HabanaAI/sglang-fork.git cd sglang-fork
Build Docker image:
docker build -f docker/Dockerfile.gaudi -t my-sglang-gaudi .
Run container:
docker run -it --runtime=habana \ -e HABANA_VISIBLE_DEVICES=all \ --net=host \ my-sglang-gaudi
Common Parameters¶
The following table lists the key parameters for SGLang server.
Parameter |
Default |
Description |
|---|---|---|
|
Required |
Path or name of the model to serve |
|
1 |
Tensor parallelism size (number of HPUs) |
|
127.0.0.1 |
Host address to bind the server |
|
30000 |
Port number for the server |
|
auto |
Data type (BF16, F16, F32) |
|
1000 |
Maximum concurrent requests |
|
0.9 |
Fraction of memory for static allocation |
|
False |
Skip warmup phase (dev only) |
Quick Troubleshooting¶
The following table provides troubleshooting instructions for common issues that may occur when using SGLang on Intel Gaudi.
Issue |
Solutions |
|---|---|
Server is not starting |
|
Out of Memory errors |
|
Slow performance |
|
Connection issues |
|