SGLang Quick Start Guide¶

This quick start guide helps you get SGLang running on Intel® Gaudi® AI accelerator in just a few steps.

Prerequisites¶

Intel Gaudi drivers installed on your system
Python 3.10 or later
Docker (optional, for containerized deployment)

Quick Installation¶

Install SGLang with Gaudi Support

# Clone the Gaudi-optimized SGLang repository
git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork

# Install SGLang
pip install -e "python[all_hpu]"

# Install Gaudi dependencies
pip install habana-torch-plugin habana-torch-dataloader

Set Environment Variables

export HABANA_VISIBLE_DEVICES=all
export PT_HPU_LAZY_MODE=0

Download a Model (Optional - SGLang will auto-download)

# Example: Download Llama-3.1-8B
huggingface-cli download meta-llama/Meta-Llama-3.1-8B

Quick Start Examples¶

Example 1: Start SGLang Server¶

Start SGLang server with a popular model:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --host 0.0.0.0 \
    --port 30000 \
    --tp-size 1

Example 2: Send a Simple Request¶

Use curl to send a request:

curl http://localhost:30000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "The capital of France is",
        "sampling_params": {"temperature": 0.7, "max_new_tokens": 64}
    }'

Example 3: Use OpenAI-Compatible API¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[{"role": "user", "content": "Hello! How are you?"}],
    max_tokens=100
)

print(response.choices[0].message.content)

Example 4: Use Native SGLang API¶

import sglang as sgl
from sglang import function, system, user, assistant, gen

@function
def simple_chat(s, question):
    s += system("You are a helpful assistant.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=100))

# Set backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))

# Run
state = simple_chat.run(question="What is machine learning?")
print(state["answer"])

Docker Quick Start¶

Using Pre-built Container¶

# Pull and run SGLang container
docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    -e PT_HPU_LAZY_MODE=0 \
    --net=host \
    sglang-gaudi:latest \
    python -m sglang.launch_server \
        --model-path microsoft/DialoGPT-medium \
        --host 0.0.0.0 \
        --port 30000

Building Your Own Container¶

# Clone repository
git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork

# Build Docker image
docker build -f docker/Dockerfile.gaudi -t my-sglang-gaudi .

# Run container
docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    --net=host \
    my-sglang-gaudi

Common Parameters¶

Key parameters for SGLang server:

Parameter	Default	Description
`--model-path`	Required	Path or name of the model to serve
`--tp-size`	1	Tensor parallelism size (number of HPUs)
`--host`	127.0.0.1	Host address to bind the server
`--port`	30000	Port number for the server
`--dtype`	auto	Data type (bfloat16, float16, float32)
`--max-running-requests`	1000	Maximum concurrent requests
`--mem-fraction-static`	0.9	Fraction of memory for static allocation
`--disable-warmup`	False	Skip warmup phase (dev only)

Quick Troubleshooting¶

Server won’t start:

Check if Gaudi drivers are installed: hl-smi
Verify environment variables are set
Ensure sufficient memory for the model

Out of Memory errors:

Increase --tp-size to distribute across more HPUs
Use smaller model or lower precision (--dtype bfloat16)

Slow performance:

Enable warmup in production (remove --disable-warmup)
Tune --mem-fraction-static
Consider using --chunked-prefill-size

Connection issues:

Check firewall settings for the port
Use --host 0.0.0.0 to bind to all interfaces
Verify the server started successfully in logs

Next Steps¶

Read the full Inference Using SGLang guide
Explore performance tuning in Managing and Reducing SGLang Warmup Time
Check out SGLang with Gaudi FAQs for common questions

For more detailed information and advanced configurations, continue to the full documentation sections.

Gaudi Documentation 1.22.1 documentation

SGLang Quick Start Guide

On this Page