Profiling with SGLang¶

This document describes how to profile SGLang performance on Intel® Gaudi® AI accelerator to identify bottlenecks and optimize inference workloads.

Overview¶

SGLang profiling helps you:

Identify Performance Bottlenecks: Locate slow operations in your inference pipeline
Optimize Memory Usage: Understand memory allocation patterns and optimize utilization
Tune Parameters: Make data-driven decisions about configuration parameters
Monitor Production: Track performance metrics in production environments
Debug Issues: Diagnose performance problems and unexpected behavior

SGLang provides multiple profiling tools and integration with Gaudi’s native profiling capabilities.

Built-in Profiling¶

Basic Performance Metrics¶

Enable basic profiling to track key metrics:

python -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B \
    --enable-profiling \
    --profiling-interval 10 \
    --host 0.0.0.0 \
    --port 30000

Access metrics via the metrics endpoint:

curl http://localhost:30000/metrics

Example metrics output:

{
    "throughput": {
        "requests_per_second": 12.5,
        "tokens_per_second": 1250.0,
        "prefill_tokens_per_second": 2500.0,
        "decode_tokens_per_second": 1150.0
    },
    "latency": {
        "time_to_first_token_ms": 45.2,
        "inter_token_latency_ms": 8.1,
        "end_to_end_latency_ms": 892.3
    },
    "memory": {
        "hpu_memory_used_gb": 24.5,
        "hpu_memory_total_gb": 94.6,
        "kv_cache_usage_gb": 18.2,
        "model_memory_gb": 6.3
    },
    "requests": {
        "active_requests": 3,
        "queued_requests": 1,
        "completed_requests": 1247
    }
}

Gaudi Profiler Integration¶

Native Gaudi Profiling¶

Use Habana’s profiler for detailed device-level analysis:

# Set environment variables for profiling
export HABANA_PROFILE=profile_api_light
export SGLANG_TORCH_PROFILER_DIR=<path for profiling output>

python -m sglang.bench_offline_throughput \
    --dataset-name random \
    --dtype bfloat16 \
    --model-path /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-8B-Instruct \
    --num-prompts $NUM_PROMPTS \
    --random-input-len $INPUT_LEN \
    --random-output-len $OUTPUT_LEN \
    --device hpu \
    --page-size 128 \
    --disable-radix-cache \
    --max-prefill-tokens 2048 \
    --random-range-ratio 1.0 \
    --profile

HPU Profiler with SGLang¶

Use the HPU profiler for kernel-level analysis:

import habana_frameworks.torch.hpu.profiler as hpu_profiler
import sglang as sgl

# Enable profiling
with hpu_profiler.profile(
    activities=[
        hpu_profiler.ProfilerActivity.HPU,
        hpu_profiler.ProfilerActivity.CPU
    ],
    output_dir="./hpu_traces"
) as prof:

    # Set up SGLang backend
    backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
    sgl.set_default_backend(backend)

    # Run inference with profiling
    @sgl.function
    def profiled_inference(s, prompt):
        s += sgl.user(prompt)
        s += sgl.assistant(sgl.gen("response", max_tokens=100))

    # Execute multiple requests for comprehensive profiling
    for i in range(10):
        result = profiled_inference.run(prompt=f"Test prompt {i}")

Custom Profiling¶

Performance Benchmarking¶

Create custom benchmarks for your specific use case:

import time
import statistics
import sglang as sgl
from sglang import function, user, assistant, gen

class SGLangProfiler:
    def __init__(self, model_path, num_warmup=5, num_iterations=20):
        self.backend = sgl.Runtime(model_path=model_path)
        sgl.set_default_backend(self.backend)
        self.num_warmup = num_warmup
        self.num_iterations = num_iterations

    def benchmark_throughput(self, prompts, max_tokens=100):
        @function
        def inference_func(s, prompt):
            s += user(prompt)
            s += assistant(gen("response", max_tokens=max_tokens))

        # Warmup
        for _ in range(self.num_warmup):
            inference_func.run(prompt=prompts[0])

        # Actual benchmark
        latencies = []
        start_time = time.time()

        for i in range(self.num_iterations):
            prompt = prompts[i % len(prompts)]

            request_start = time.time()
            result = inference_func.run(prompt=prompt)
            request_end = time.time()

            latencies.append(request_end - request_start)

        total_time = time.time() - start_time

        return {
            "total_time": total_time,
            "requests_per_second": self.num_iterations / total_time,
            "avg_latency": statistics.mean(latencies),
            "p50_latency": statistics.median(latencies),
            "p95_latency": sorted(latencies)[int(0.95 * len(latencies))],
            "p99_latency": sorted(latencies)[int(0.99 * len(latencies))]
        }

# Usage
profiler = SGLangProfiler("meta-llama/Meta-Llama-3.1-8B")
test_prompts = [
    "Explain machine learning",
    "What is quantum computing?",
    "Describe neural networks",
    # Add more test prompts...
]

results = profiler.benchmark_throughput(test_prompts)
print(f"Throughput: {results['requests_per_second']:.2f} req/s")
print(f"P50 Latency: {results['p50_latency']:.3f}s")

Memory Profiling¶

Monitor memory usage patterns:

import psutil
import time
import sglang as sgl

class MemoryProfiler:
    def __init__(self):
        self.memory_snapshots = []

    def take_snapshot(self, label):
        # Get HPU memory info (requires habana-torch-plugin)
        try:
            import habana_frameworks.torch.hpu as hpu
            hpu_memory = hpu.memory_stats()
            hpu_allocated = hpu_memory.get('allocated_bytes.all.current', 0) / 1024**3
            hpu_reserved = hpu_memory.get('reserved_bytes.all.current', 0) / 1024**3
        except:
            hpu_allocated = hpu_reserved = 0

        # Get system memory
        process = psutil.Process()
        memory_info = process.memory_info()

        snapshot = {
            'label': label,
            'timestamp': time.time(),
            'hpu_allocated_gb': hpu_allocated,
            'hpu_reserved_gb': hpu_reserved,
            'system_memory_gb': memory_info.rss / 1024**3,
            'virtual_memory_gb': memory_info.vms / 1024**3
        }

        self.memory_snapshots.append(snapshot)
        return snapshot

    def print_summary(self):
        print("Memory Usage Summary:")
        print("-" * 80)
        for snapshot in self.memory_snapshots:
            print(f"{snapshot['label']:20} | "
                  f"HPU: {snapshot['hpu_allocated_gb']:.2f}GB | "
                  f"System: {snapshot['system_memory_gb']:.2f}GB")

# Usage
memory_profiler = MemoryProfiler()

# Profile different stages
memory_profiler.take_snapshot("Initial")

backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
memory_profiler.take_snapshot("After model load")

sgl.set_default_backend(backend)
memory_profiler.take_snapshot("After backend setup")

# Run inference
@sgl.function
def test_inference(s, prompt):
    s += sgl.user(prompt)
    s += sgl.assistant(sgl.gen("response", max_tokens=100))

result = test_inference.run(prompt="Hello, how are you?")
memory_profiler.take_snapshot("After inference")

memory_profiler.print_summary()

Production Monitoring¶

Continuous Monitoring Setup¶

Set up continuous monitoring for production deployments:

import time
import json
import requests
from datetime import datetime

class ProductionMonitor:
    def __init__(self, sglang_url="http://localhost:30000"):
        self.sglang_url = sglang_url
        self.metrics_history = []

    def collect_metrics(self):
        try:
            response = requests.get(f"{self.sglang_url}/metrics")
            metrics = response.json()

            # Add timestamp
            metrics['timestamp'] = datetime.now().isoformat()
            self.metrics_history.append(metrics)

            return metrics
        except Exception as e:
            print(f"Failed to collect metrics: {e}")
            return None

    def check_health(self):
        try:
            response = requests.get(f"{self.sglang_url}/health")
            return response.status_code == 200
        except:
            return False

    def analyze_performance(self, window_minutes=10):
        cutoff_time = datetime.now().timestamp() - (window_minutes * 60)
        recent_metrics = [
            m for m in self.metrics_history
            if datetime.fromisoformat(m['timestamp']).timestamp() > cutoff_time
        ]

        if not recent_metrics:
            return None

        # Calculate averages
        avg_throughput = sum(m['throughput']['tokens_per_second']
                           for m in recent_metrics) / len(recent_metrics)
        avg_latency = sum(m['latency']['time_to_first_token_ms']
                        for m in recent_metrics) / len(recent_metrics)

        return {
            'avg_throughput_tps': avg_throughput,
            'avg_time_to_first_token_ms': avg_latency,
            'sample_count': len(recent_metrics)
        }

    def save_metrics(self, filename):
        with open(filename, 'w') as f:
            json.dump(self.metrics_history, f, indent=2)

# Usage for continuous monitoring
monitor = ProductionMonitor()

while True:
    if monitor.check_health():
        metrics = monitor.collect_metrics()
        if metrics:
            # Check for performance issues
            throughput = metrics['throughput']['tokens_per_second']
            if throughput < 100:  # Threshold
                print(f"WARNING: Low throughput detected: {throughput} TPS")

            memory_usage = metrics['memory']['hpu_memory_used_gb']
            total_memory = metrics['memory']['hpu_memory_total_gb']
            if memory_usage / total_memory > 0.95:  # 95% threshold
                print(f"WARNING: High memory usage: {memory_usage:.1f}GB / {total_memory:.1f}GB")
    else:
        print("ERROR: SGLang server health check failed")

    time.sleep(30)  # Check every 30 seconds

Profiling Different Workloads¶

Batch Processing Profiling¶

Profile batch inference performance:

import concurrent.futures
import time
import sglang as sgl

def profile_batch_inference(prompts, batch_size=4, max_workers=2):
    backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
    sgl.set_default_backend(backend)

    @sgl.function
    def batch_inference(s, prompt):
        s += sgl.user(prompt)
        s += sgl.assistant(sgl.gen("response", max_tokens=100))

    start_time = time.time()

    # Process in batches
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        for i in range(0, len(prompts), batch_size):
            batch = prompts[i:i+batch_size]
            futures = [executor.submit(batch_inference.run, prompt=p) for p in batch]
            batch_results = [f.result() for f in futures]
            results.extend(batch_results)

    total_time = time.time() - start_time

    return {
        'total_requests': len(prompts),
        'total_time': total_time,
        'requests_per_second': len(prompts) / total_time,
        'batch_size': batch_size,
        'max_workers': max_workers
    }

Streaming Performance Profiling¶

Profile streaming response performance:

import sglang as sgl
import time

def profile_streaming(prompt, max_tokens=200):
    backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
    sgl.set_default_backend(backend)

    @sgl.function
    def streaming_inference(s, prompt):
        s += sgl.user(prompt)
        s += sgl.assistant(sgl.gen("response", max_tokens=max_tokens, stream=True))

    start_time = time.time()
    first_token_time = None
    token_times = []

    result = streaming_inference.run(prompt=prompt)

    # Simulate streaming processing
    for i, token in enumerate(result.stream()):
        current_time = time.time()
        if first_token_time is None:
            first_token_time = current_time - start_time
        else:
            token_times.append(current_time)

    total_time = time.time() - start_time

    # Calculate inter-token latencies
    inter_token_latencies = []
    for i in range(1, len(token_times)):
        inter_token_latencies.append(token_times[i] - token_times[i-1])

    return {
        'time_to_first_token': first_token_time,
        'total_generation_time': total_time,
        'total_tokens': len(token_times),
        'avg_inter_token_latency': sum(inter_token_latencies) / len(inter_token_latencies) if inter_token_latencies else 0,
        'tokens_per_second': len(token_times) / total_time if total_time > 0 else 0
    }

Troubleshooting Performance Issues¶

Common Performance Problems¶

Low Throughput

Check resource utilization:

hl-smi  # Check HPU utilization
htop    # Check CPU usage

Profile memory usage:

curl http://localhost:30000/metrics | jq '.memory'

Analyze request queue:

curl http://localhost:30000/debug/queue_status

High Latency

Enable detailed tracing:

--enable-request-tracing --trace-output-dir ./traces

Check for memory pressure:

# Look for memory allocation delays in logs
grep "memory" sglang_server.log

Profile warmup completeness:

curl http://localhost:30000/warmup/status

Memory Issues

Monitor memory patterns:

# Use memory profiling script
python memory_profiler.py

Check for memory leaks:

# Monitor memory over time
watch -n 5 'hl-smi | grep Memory'

Best Practices¶

Profile Regularly: Set up automated profiling for production systems
Use Representative Workloads: Profile with realistic request patterns
Monitor Key Metrics: Focus on throughput, latency, and memory usage
Establish Baselines: Record performance before making changes
Profile Different Scenarios: Test various batch sizes and sequence lengths
Archive Results: Keep historical profiling data for comparison

For additional profiling techniques and troubleshooting, see SGLang with Gaudi FAQs and Inference Using SGLang.

Gaudi Documentation 1.22.1 documentation

Profiling with SGLang

On this Page