SGLang Quick Start Guide

This quick start guide helps you get SGLang running on Intel® Gaudi® AI accelerator in just a few steps.

Prerequisites

Running SGLang on Intel Gaudi

  1. Install SGLang with Gaudi Support:

  1. Clone the vLLM fork repository and navigate to the appropriate directory:

git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork
  1. Install SGLang:

pip install -e "python[all_hpu]"
  1. Install Gaudi dependencies:

pip install habana-torch-plugin habana-torch-dataloader
  1. Set the environment variables:

    export HABANA_VISIBLE_DEVICES=all
    export PT_HPU_LAZY_MODE=0
    

Quick Start Examples

Start SGLang server with a popular model:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --host 0.0.0.0 \
    --port 30000 \
    --tp-size 1

Use curl to send a request:

curl http://localhost:30000/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "The capital of France is",
        "sampling_params": {"temperature": 0.7, "max_new_tokens": 64}
    }'
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[{"role": "user", "content": "Hello! How are you?"}],
    max_tokens=100
)

print(response.choices[0].message.content)
import sglang as sgl
from sglang import function, system, user, assistant, gen

@function
def simple_chat(s, question):
    s += system("You are a helpful assistant.")
    s += user(question)
    s += assistant(gen("answer", max_tokens=100))

# Set backend
sgl.set_default_backend(sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B"))

# Run
state = simple_chat.run(question="What is machine learning?")
print(state["answer"])

Docker Quick Start

# Pull and run SGLang container
docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    -e PT_HPU_LAZY_MODE=0 \
    --net=host \
    sglang-gaudi:latest \
    python -m sglang.launch_server \
        --model-path microsoft/DialoGPT-medium \
        --host 0.0.0.0 \
        --port 30000
  1. Clone the SGLang fork repository and navigate to the appropriate directory:

git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork
  1. Build Docker image:

docker build -f docker/Dockerfile.gaudi -t my-sglang-gaudi .
  1. Run container:

docker run -it --runtime=habana \
    -e HABANA_VISIBLE_DEVICES=all \
    --net=host \
    my-sglang-gaudi

Common Parameters

The following table lists the key parameters for SGLang server.

Parameter

Default

Description

--model-path

Required

Path or name of the model to serve

--tp-size

1

Tensor parallelism size (number of HPUs)

--host

127.0.0.1

Host address to bind the server

--port

30000

Port number for the server

--dtype

auto

Data type (BF16, F16, F32)

--max-running-requests

1000

Maximum concurrent requests

--mem-fraction-static

0.9

Fraction of memory for static allocation

--disable-warmup

False

Skip warmup phase (dev only)

Quick Troubleshooting

The following table provides troubleshooting instructions for common issues that may occur when using SGLang on Intel Gaudi.

Issue

Solutions

Server is not starting

  • Check if Gaudi drivers are installed using hl-smi.

  • Verify the environment variables are set.

  • Make sure there is sufficient memory for the model.

Out of Memory errors

  • Increase --tp-size to distribute across more Gaudis.

  • Use a smaller model or lower precision (--dtype bfloat16).

Slow performance

  • Enable warmup in production (remove --disable-warmup).

  • Tune --mem-fraction-static

  • Consider using --chunked-prefill-size

Connection issues

  • Check firewall settings for the port

  • Use --host 0.0.0.0 to bind to all interfaces

  • Verify the server started successfully in logs