Profiling with SGLang
On this Page
Profiling with SGLang¶
This document describes how to profile SGLang performance on Intel® Gaudi® AI accelerator to identify bottlenecks and optimize inference workloads.
Overview¶
SGLang profiling helps you:
Identify Performance Bottlenecks: Locate slow operations in your inference pipeline
Optimize Memory Usage: Understand memory allocation patterns and optimize utilization
Tune Parameters: Make data-driven decisions about configuration parameters
Monitor Production: Track performance metrics in production environments
Debug Issues: Diagnose performance problems and unexpected behavior
SGLang provides multiple profiling tools and integration with Gaudi’s native profiling capabilities.
Built-in Profiling¶
Basic Performance Metrics¶
Enable basic profiling to track key metrics:
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B \
--enable-profiling \
--profiling-interval 10 \
--host 0.0.0.0 \
--port 30000
Access metrics via the metrics endpoint:
curl http://localhost:30000/metrics
Example metrics output:
{
"throughput": {
"requests_per_second": 12.5,
"tokens_per_second": 1250.0,
"prefill_tokens_per_second": 2500.0,
"decode_tokens_per_second": 1150.0
},
"latency": {
"time_to_first_token_ms": 45.2,
"inter_token_latency_ms": 8.1,
"end_to_end_latency_ms": 892.3
},
"memory": {
"hpu_memory_used_gb": 24.5,
"hpu_memory_total_gb": 94.6,
"kv_cache_usage_gb": 18.2,
"model_memory_gb": 6.3
},
"requests": {
"active_requests": 3,
"queued_requests": 1,
"completed_requests": 1247
}
}
Gaudi Profiler Integration¶
Native Gaudi Profiling¶
Use Habana’s profiler for detailed device-level analysis:
# Set environment variables for profiling
export HABANA_PROFILE=profile_api_light
export SGLANG_TORCH_PROFILER_DIR=<path for profiling output>
python -m sglang.bench_offline_throughput \
--dataset-name random \
--dtype bfloat16 \
--model-path /mnt/weka/data/pytorch/llama3.1/Meta-Llama-3.1-8B-Instruct \
--num-prompts $NUM_PROMPTS \
--random-input-len $INPUT_LEN \
--random-output-len $OUTPUT_LEN \
--device hpu \
--page-size 128 \
--disable-radix-cache \
--max-prefill-tokens 2048 \
--random-range-ratio 1.0 \
--profile
HPU Profiler with SGLang¶
Use the HPU profiler for kernel-level analysis:
import habana_frameworks.torch.hpu.profiler as hpu_profiler
import sglang as sgl
# Enable profiling
with hpu_profiler.profile(
activities=[
hpu_profiler.ProfilerActivity.HPU,
hpu_profiler.ProfilerActivity.CPU
],
output_dir="./hpu_traces"
) as prof:
# Set up SGLang backend
backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
sgl.set_default_backend(backend)
# Run inference with profiling
@sgl.function
def profiled_inference(s, prompt):
s += sgl.user(prompt)
s += sgl.assistant(sgl.gen("response", max_tokens=100))
# Execute multiple requests for comprehensive profiling
for i in range(10):
result = profiled_inference.run(prompt=f"Test prompt {i}")
Custom Profiling¶
Performance Benchmarking¶
Create custom benchmarks for your specific use case:
import time
import statistics
import sglang as sgl
from sglang import function, user, assistant, gen
class SGLangProfiler:
def __init__(self, model_path, num_warmup=5, num_iterations=20):
self.backend = sgl.Runtime(model_path=model_path)
sgl.set_default_backend(self.backend)
self.num_warmup = num_warmup
self.num_iterations = num_iterations
def benchmark_throughput(self, prompts, max_tokens=100):
@function
def inference_func(s, prompt):
s += user(prompt)
s += assistant(gen("response", max_tokens=max_tokens))
# Warmup
for _ in range(self.num_warmup):
inference_func.run(prompt=prompts[0])
# Actual benchmark
latencies = []
start_time = time.time()
for i in range(self.num_iterations):
prompt = prompts[i % len(prompts)]
request_start = time.time()
result = inference_func.run(prompt=prompt)
request_end = time.time()
latencies.append(request_end - request_start)
total_time = time.time() - start_time
return {
"total_time": total_time,
"requests_per_second": self.num_iterations / total_time,
"avg_latency": statistics.mean(latencies),
"p50_latency": statistics.median(latencies),
"p95_latency": sorted(latencies)[int(0.95 * len(latencies))],
"p99_latency": sorted(latencies)[int(0.99 * len(latencies))]
}
# Usage
profiler = SGLangProfiler("meta-llama/Meta-Llama-3.1-8B")
test_prompts = [
"Explain machine learning",
"What is quantum computing?",
"Describe neural networks",
# Add more test prompts...
]
results = profiler.benchmark_throughput(test_prompts)
print(f"Throughput: {results['requests_per_second']:.2f} req/s")
print(f"P50 Latency: {results['p50_latency']:.3f}s")
Memory Profiling¶
Monitor memory usage patterns:
import psutil
import time
import sglang as sgl
class MemoryProfiler:
def __init__(self):
self.memory_snapshots = []
def take_snapshot(self, label):
# Get HPU memory info (requires habana-torch-plugin)
try:
import habana_frameworks.torch.hpu as hpu
hpu_memory = hpu.memory_stats()
hpu_allocated = hpu_memory.get('allocated_bytes.all.current', 0) / 1024**3
hpu_reserved = hpu_memory.get('reserved_bytes.all.current', 0) / 1024**3
except:
hpu_allocated = hpu_reserved = 0
# Get system memory
process = psutil.Process()
memory_info = process.memory_info()
snapshot = {
'label': label,
'timestamp': time.time(),
'hpu_allocated_gb': hpu_allocated,
'hpu_reserved_gb': hpu_reserved,
'system_memory_gb': memory_info.rss / 1024**3,
'virtual_memory_gb': memory_info.vms / 1024**3
}
self.memory_snapshots.append(snapshot)
return snapshot
def print_summary(self):
print("Memory Usage Summary:")
print("-" * 80)
for snapshot in self.memory_snapshots:
print(f"{snapshot['label']:20} | "
f"HPU: {snapshot['hpu_allocated_gb']:.2f}GB | "
f"System: {snapshot['system_memory_gb']:.2f}GB")
# Usage
memory_profiler = MemoryProfiler()
# Profile different stages
memory_profiler.take_snapshot("Initial")
backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
memory_profiler.take_snapshot("After model load")
sgl.set_default_backend(backend)
memory_profiler.take_snapshot("After backend setup")
# Run inference
@sgl.function
def test_inference(s, prompt):
s += sgl.user(prompt)
s += sgl.assistant(sgl.gen("response", max_tokens=100))
result = test_inference.run(prompt="Hello, how are you?")
memory_profiler.take_snapshot("After inference")
memory_profiler.print_summary()
Production Monitoring¶
Continuous Monitoring Setup¶
Set up continuous monitoring for production deployments:
import time
import json
import requests
from datetime import datetime
class ProductionMonitor:
def __init__(self, sglang_url="http://localhost:30000"):
self.sglang_url = sglang_url
self.metrics_history = []
def collect_metrics(self):
try:
response = requests.get(f"{self.sglang_url}/metrics")
metrics = response.json()
# Add timestamp
metrics['timestamp'] = datetime.now().isoformat()
self.metrics_history.append(metrics)
return metrics
except Exception as e:
print(f"Failed to collect metrics: {e}")
return None
def check_health(self):
try:
response = requests.get(f"{self.sglang_url}/health")
return response.status_code == 200
except:
return False
def analyze_performance(self, window_minutes=10):
cutoff_time = datetime.now().timestamp() - (window_minutes * 60)
recent_metrics = [
m for m in self.metrics_history
if datetime.fromisoformat(m['timestamp']).timestamp() > cutoff_time
]
if not recent_metrics:
return None
# Calculate averages
avg_throughput = sum(m['throughput']['tokens_per_second']
for m in recent_metrics) / len(recent_metrics)
avg_latency = sum(m['latency']['time_to_first_token_ms']
for m in recent_metrics) / len(recent_metrics)
return {
'avg_throughput_tps': avg_throughput,
'avg_time_to_first_token_ms': avg_latency,
'sample_count': len(recent_metrics)
}
def save_metrics(self, filename):
with open(filename, 'w') as f:
json.dump(self.metrics_history, f, indent=2)
# Usage for continuous monitoring
monitor = ProductionMonitor()
while True:
if monitor.check_health():
metrics = monitor.collect_metrics()
if metrics:
# Check for performance issues
throughput = metrics['throughput']['tokens_per_second']
if throughput < 100: # Threshold
print(f"WARNING: Low throughput detected: {throughput} TPS")
memory_usage = metrics['memory']['hpu_memory_used_gb']
total_memory = metrics['memory']['hpu_memory_total_gb']
if memory_usage / total_memory > 0.95: # 95% threshold
print(f"WARNING: High memory usage: {memory_usage:.1f}GB / {total_memory:.1f}GB")
else:
print("ERROR: SGLang server health check failed")
time.sleep(30) # Check every 30 seconds
Profiling Different Workloads¶
Batch Processing Profiling¶
Profile batch inference performance:
import concurrent.futures
import time
import sglang as sgl
def profile_batch_inference(prompts, batch_size=4, max_workers=2):
backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
sgl.set_default_backend(backend)
@sgl.function
def batch_inference(s, prompt):
s += sgl.user(prompt)
s += sgl.assistant(sgl.gen("response", max_tokens=100))
start_time = time.time()
# Process in batches
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i+batch_size]
futures = [executor.submit(batch_inference.run, prompt=p) for p in batch]
batch_results = [f.result() for f in futures]
results.extend(batch_results)
total_time = time.time() - start_time
return {
'total_requests': len(prompts),
'total_time': total_time,
'requests_per_second': len(prompts) / total_time,
'batch_size': batch_size,
'max_workers': max_workers
}
Streaming Performance Profiling¶
Profile streaming response performance:
import sglang as sgl
import time
def profile_streaming(prompt, max_tokens=200):
backend = sgl.Runtime(model_path="meta-llama/Meta-Llama-3.1-8B")
sgl.set_default_backend(backend)
@sgl.function
def streaming_inference(s, prompt):
s += sgl.user(prompt)
s += sgl.assistant(sgl.gen("response", max_tokens=max_tokens, stream=True))
start_time = time.time()
first_token_time = None
token_times = []
result = streaming_inference.run(prompt=prompt)
# Simulate streaming processing
for i, token in enumerate(result.stream()):
current_time = time.time()
if first_token_time is None:
first_token_time = current_time - start_time
else:
token_times.append(current_time)
total_time = time.time() - start_time
# Calculate inter-token latencies
inter_token_latencies = []
for i in range(1, len(token_times)):
inter_token_latencies.append(token_times[i] - token_times[i-1])
return {
'time_to_first_token': first_token_time,
'total_generation_time': total_time,
'total_tokens': len(token_times),
'avg_inter_token_latency': sum(inter_token_latencies) / len(inter_token_latencies) if inter_token_latencies else 0,
'tokens_per_second': len(token_times) / total_time if total_time > 0 else 0
}
Troubleshooting Performance Issues¶
Common Performance Problems¶
Low Throughput
Check resource utilization:
hl-smi # Check HPU utilization htop # Check CPU usage
Profile memory usage:
curl http://localhost:30000/metrics | jq '.memory'
Analyze request queue:
curl http://localhost:30000/debug/queue_status
High Latency
Enable detailed tracing:
--enable-request-tracing --trace-output-dir ./traces
Check for memory pressure:
# Look for memory allocation delays in logs grep "memory" sglang_server.log
Profile warmup completeness:
curl http://localhost:30000/warmup/status
Memory Issues
Monitor memory patterns:
# Use memory profiling script python memory_profiler.py
Check for memory leaks:
# Monitor memory over time watch -n 5 'hl-smi | grep Memory'
Best Practices¶
Profile Regularly: Set up automated profiling for production systems
Use Representative Workloads: Profile with realistic request patterns
Monitor Key Metrics: Focus on throughput, latency, and memory usage
Establish Baselines: Record performance before making changes
Profile Different Scenarios: Test various batch sizes and sequence lengths
Archive Results: Keep historical profiling data for comparison
For additional profiling techniques and troubleshooting, see SGLang with Gaudi FAQs and Inference Using SGLang.