SGLang with Gaudi FAQs

This document provides answers to frequently asked questions about using SGLang with Intel® Gaudi® AI accelerator.

Installation and Setup

Q: How do I install SGLang with Gaudi support?

A: Follow these steps:

# Clone the Gaudi-optimized SGLang repository
git clone https://github.com/HabanaAI/sglang-fork.git
cd sglang-fork

# Install SGLang with all dependencies
pip install -e "python[all]"

# Install Gaudi-specific dependencies
pip install habana-torch-plugin habana-torch-dataloader

# Set environment variables
export HABANA_VISIBLE_DEVICES=all
export PT_HPU_LAZY_MODE=0

Q: What are the minimum system requirements for SGLang on Gaudi?

A: - Intel Gaudi2 or newer HPU - 32GB+ system RAM (64GB+ recommended for larger models) - Python 3.10 or later - Ubuntu 20.04+ or similar Linux distribution - Gaudi drivers and software stack installed

Q: Can I run SGLang on multiple Gaudi devices?

A: Yes, use tensor parallelism to distribute the model across multiple HPUs:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --tp-size 4 \
    --host 0.0.0.0 \
    --port 30000

Model Support and Compatibility

Q: Which models are supported on SGLang with Gaudi?

A: SGLang on Gaudi supports most popular model architectures including:

  • LLaMA family: LLaMA, LLaMA-2, LLaMA-3, LLaMA-3.1, LLaMA-3.2

  • Mistral family: Mistral-7B, Mixtral-8x7B, Mixtral-8x22B

  • Qwen family: Qwen-7B, Qwen-14B, Qwen-72B, Qwen2, Qwen2.5

  • DeepSeek family: DeepSeek-V2, DeepSeek-V3, DeepSeek-R1

Q: Can I use custom or fine-tuned models?

A: Yes, SGLang supports custom models as long as they use compatible architectures. For fine-tuned models:

python -m sglang.launch_server \
    --model-path /path/to/your/fine-tuned-model \
    --host 0.0.0.0 \
    --port 30000

Performance and Optimization

Q: How can I improve SGLang performance on Gaudi?

A: Key optimization strategies:

  1. Enable warmup in production:

    # Don't use --disable-warmup in production
    
  2. Tune memory settings:

    --mem-fraction-static 0.9
    
  3. Use appropriate batch sizes:

    --max-running-requests 64
    
  4. Enable chunked prefill for long sequences:

    --chunked-prefill-size 8192
    

Q: What’s the difference between prefill and decode performance?

A: - Prefill: Processing the input prompt - typically measured in tokens/second - Decode: Generating output tokens - typically measured in tokens/second per sequence

SGLang optimizes both phases differently. Monitor both metrics to understand overall performance.

Q: Why is my first request slow?

A: This is normal due to: - JIT compilation of kernels - Memory allocation - Cache warming

Memory Management

Q: How much memory does my model need?

A: Use the memory estimation tool:

python -m sglang.tools.estimate_memory \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --tp-size 1 \
    --max-seq-len 4096

Q: I’m getting Out of Memory (OOM) errors. How do I fix this?

A: Try these solutions in order:

  1. Reduce concurrent requests:

    --max-running-requests 32
    
  2. Use tensor parallelism:

    --tp-size 2  # or higher
    
  3. Reduce sequence length:

    --max-seq-len 2048
    
  4. Lower memory utilization:

    --mem-fraction-static 0.8
    

Q: How do I monitor memory usage?

A: Use these methods:

# Check HPU memory
hl-smi

# Get SGLang memory metrics
curl http://localhost:30000/metrics | jq '.memory'

# Monitor in real-time
watch -n 1 'hl-smi | grep Memory'

Q: What’s the difference between static and dynamic memory allocation?

A: - Static memory: Pre-allocated for model weights and KV cache (controlled by --mem-fraction-static) - Dynamic memory: Allocated as needed for intermediate computations

Static allocation is more efficient but less flexible.

API and Integration

Q: How do I use SGLang with the OpenAI Python client?

A: SGLang provides OpenAI-compatible endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require authentication
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Q: Can I use SGLang with LangChain?

A: Yes, use the OpenAI integration:

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:30000/v1",
    openai_api_key="EMPTY",
    model_name="meta-llama/Meta-Llama-3.1-8B"
)

Q: How do I stream responses?

A: Enable streaming in your requests:

# OpenAI-compatible streaming
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Q: How do I use structured generation (JSON mode)?

A: Use SGLang’s native API for structured generation:

import sglang as sgl

@sgl.function
def structured_gen(s, query):
    s += sgl.user(f"Generate JSON for: {query}")
    s += sgl.assistant(sgl.gen("json_output", regex=r'\{.*\}'))

result = structured_gen.run(query="A person with name and age")

Troubleshooting

Q: SGLang server won’t start. What should I check?

A: Debug startup issues:

  1. Check Gaudi devices:

    hl-smi  # Should show available HPUs
    
  2. Verify environment:

    echo $HABANA_VISIBLE_DEVICES
    echo $PT_HPU_LAZY_MODE
    

Q: Requests are hanging or timing out. What’s wrong?

A: Common causes and solutions:

  1. Server overloaded: Reduce --max-running-requests

  2. Memory pressure: Check memory usage with hl-smi

  3. Network issues: Verify connectivity and firewall settings

  4. Warmup not complete: Wait for warmup to finish or check logs