SGLang Quick Start Guide

This quick start guide helps you get SGLang running on Intel® Gaudi® AI accelerator in just a few steps.

Prerequisites

  • Intel Gaudi drivers installed on your system

  • Python 3.10 or later

  • Docker (optional, for containerized deployment)

Quick Installation

  1. Install SGLang with Gaudi Support

    # Clone the Gaudi-optimized SGLang repository
    git clone https://github.com/HabanaAI/sglang-fork.git
    cd sglang-fork
    
    # Install SGLang
    pip install -e "python[all_hpu]"
    
    # Install Gaudi dependencies
    pip install habana-torch-plugin habana-torch-dataloader
    
  2. Set Environment Variables

    export HABANA_VISIBLE_DEVICES=all
    export PT_HPU_LAZY_MODE=0
    
  3. Download a Model (Optional - SGLang will auto-download)

    # Example: Download Llama-3.1-8B
    huggingface-cli download meta-llama/Meta-Llama-3.1-8B
    

Quick Start Examples

Example 1: Start SGLang Server

Start SGLang server with a popular model:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --host 0.0.0.0 \
    --port 30000 \
    --tp-size 1

Example 2: Send a Simple Request

Use curl to send a request:

curl http://localhost:30000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "The capital of France is",
        "sampling_params": {"temperature": 0.7, "max_new_tokens": 64}
    }'

Example 3: Use OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[{"role": "user", "content": "Hello! How are you?"}],
    max_tokens=100
)

print(response.choices[0].message.content)

Example 4: Use Native SGLang API

import sglang as sgl
from sglang import function, system, user, assistant, gen

@function
def simple_chat(s, question):
    s += system("You are a helpful assistant.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=100))

# Set backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))

# Run
state = simple_chat.run(question="What is machine learning?")
print(state["answer"])

Docker Quick Start

Using Pre-built Container

# Pull and run SGLang container
docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    -e PT_HPU_LAZY_MODE=0 \
    --net=host \
    sglang-gaudi:latest \
    python -m sglang.launch_server \
        --model-path microsoft/DialoGPT-medium \
        --host 0.0.0.0 \
        --port 30000

Building Your Own Container

# Clone repository
git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork

# Build Docker image
docker build -f docker/Dockerfile.gaudi -t my-sglang-gaudi .

# Run container
docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    --net=host \
    my-sglang-gaudi

Common Parameters

Key parameters for SGLang server:

Parameter

Default

Description

--model-path

Required

Path or name of the model to serve

--tp-size

1

Tensor parallelism size (number of HPUs)

--host

127.0.0.1

Host address to bind the server

--port

30000

Port number for the server

--dtype

auto

Data type (bfloat16, float16, float32)

--max-running-requests

1000

Maximum concurrent requests

--mem-fraction-static

0.9

Fraction of memory for static allocation

--disable-warmup

False

Skip warmup phase (dev only)

Quick Troubleshooting

Server won’t start:

  • Check if Gaudi drivers are installed: hl-smi

  • Verify environment variables are set

  • Ensure sufficient memory for the model

Out of Memory errors:

  • Increase --tp-size to distribute across more HPUs

  • Use smaller model or lower precision (--dtype bfloat16)

Slow performance:

  • Enable warmup in production (remove --disable-warmup)

  • Tune --mem-fraction-static

  • Consider using --chunked-prefill-size

Connection issues:

  • Check firewall settings for the port

  • Use --host 0.0.0.0 to bind to all interfaces

  • Verify the server started successfully in logs

Next Steps

For more detailed information and advanced configurations, continue to the full documentation sections.