vLLM Quick Start Guide
On this Page
vLLM Quick Start Guide¶
This guide shows how to quickly launch vLLM on Gaudi using a prebuilt Docker image with Docker Compose which is supported on Ubuntu only. It supports model benchmarking, custom runtime parameters, and a selection of validated models — including the LLama, Mistral, and Qwen. The advanced configuration is available via environment variables or YAML files.
Running vLLM on Gaudi with Docker Compose¶
Follow the steps below to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.
Clone the vLLM fork repository and navigate to the appropriate directory:
git clone https://github.com/HabanaAI/vllm-fork.git cd vllm-fork/.cd/
This ensures you have the required files and Docker Compose configurations.
Set the following environment variables:
Variable
Description
MODEL
Choose a model name from the Supported Models list.
HF_TOKEN
Your Hugging Face token (generate one at https://huggingface.co).
DOCKER_IMAGE
The Docker image name or URL for the vLLM Gaudi container.
Run the vLLM server using Docker Compose:
MODEL="Qwen/Qwen2.5-14B-Instruct" \ HF_TOKEN="<your huggingface token>" \ DOCKER_IMAGE="<docker image url>" \ docker compose up
To automatically run benchmarking for a selected model using default settings, add the
--profile benchmark up
option:MODEL="Qwen/Qwen2.5-14B-Instruct" \ HF_TOKEN="<your huggingface token>" \ DOCKER_IMAGE="<docker image url>" \ docker compose --profile benchmark up
This command launches the vLLM server and runs the associated benchmark suite.
Advanced Options¶
The following steps cover optional advanced configurations for running the vLLM server and benchmark. These allow you to fine-tune performance, memory usage, and request handling using additional environment variables or configuration files. For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.
Run vLLM Using Docker Compose with Custom Parameters
To override default settings, you can provide additional environment variables when starting the server. This advanced method allows fine-tuning for performance and memory usage.
Environment variables:
Variable
Description
PT_HPU_LAZY_MODE
Enables Lazy execution mode, potentially improving performance by batching operations.
VLLM_SKIP_WARMUP
Skips the model warmup phase to reduce startup time (may affect initial latency).
MAX_MODEL_LEN
Sets the maximum supported sequence length for the model.
MAX_NUM_SEQS
Specifies the maximum number of sequences processed concurrently.
TENSOR_PARALLEL_SIZE
Defines the degree of tensor parallelism.
VLLM_EXPONENTIAL_BUCKETING
Enables or disables exponential bucketing for warmup strategy.
VLLM_DECODE_BLOCK_BUCKET_STEP
Configures the step size for decode block allocation, affecting memory granularity.
VLLM_DECODE_BS_BUCKET_STEP
Sets the batch size step for decode operations, impacting how decode batches are grouped.
VLLM_PROMPT_BS_BUCKET_STEP
Adjusts the batch size step for prompt processing.
VLLM_PROMPT_SEQ_BUCKET_STEP
Controls the step size for prompt sequence allocation.
Example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \ HF_TOKEN="<your huggingface token>" \ DOCKER_IMAGE="<docker image url>" \ TENSOR_PARALLEL_SIZE=1 \ MAX_MODEL_LEN=2048 \ docker compose up
Run vLLM and Benchmark with Custom Parameters
You can customize benchmark behavior by setting additional environment variables before running Docker Compose.
Benchmark parameters:
Variable
Description
INPUT_TOK
Number of input tokens per prompt.
OUTPUT_TOK
Number of output tokens to generate per prompt.
CON_REQ
Number of concurrent requests during benchmarking.
NUM_PROMPTS
Total number of prompts to use in the benchmark.
Example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \ HF_TOKEN="<your huggingface token>" \ DOCKER_IMAGE="<docker image url>" \ INPUT_TOK=128 \ OUTPUT_TOK=128 \ CON_REQ=16 \ NUM_PROMPTS=64 \ docker compose --profile benchmark upThis launches the vLLM server and runs the benchmark using your specified parameters.
Run vLLM and Benchmark with Combined Custom Parameters
You can launch the vLLM server and benchmark together, providing any combination of server and benchmark-specific parameters.
Example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \ HF_TOKEN="<your huggingface token>" \ DOCKER_IMAGE="<docker image url>" \ TENSOR_PARALLEL_SIZE=1 \ MAX_MODEL_LEN=2048 \ INPUT_TOK=128 \ OUTPUT_TOK=128 \ CON_REQ=16 \ NUM_PROMPTS=64 \ docker compose --profile benchmark upThis command starts the server and executes benchmarking with the provided configuration.
Run vLLM and Benchmark Using Configuration Files
You can also configure the server and benchmark via YAML configuration files. Set the following environment variables:
Variable
Description
VLLM_SERVER_CONFIG_FILE
Path to the server config file inside the Docker container.
VLLM_SERVER_CONFIG_NAME
Name of the server config section.
VLLM_BENCHMARK_CONFIG_FILE
Path to the benchmark config file inside the container.
VLLM_BENCHMARK_CONFIG_NAME
Name of the benchmark config section.
Example:
HF_TOKEN=<your huggingface token> \ VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \ VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \ VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \ VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \ docker compose --profile benchmark upNote
When using configuration files, you do not need to set the
MODEL
variable as the model details are included in the config files. However, theHF_TOKEN
flag is still required.
Run vLLM Directly Using Docker
For maximum control, you can run the server directly using the docker run
command, allowing full customization of Docker runtime settings.
Example:
docker run -it --rm \ -e MODEL=$MODEL \ -e HF_TOKEN=$HF_TOKEN \ -e http_proxy=$http_proxy \ -e https_proxy=$https_proxy \ -e no_proxy=$no_proxy \ --cap-add=sys_nice \ --ipc=host \ --runtime=habana \ -e HABANA_VISIBLE_DEVICES=all \ -p 8000:8000 \ --name vllm-server \ <docker image name>This method provides full flexibility over how the vLLM server is executed within the container.
Supported Models¶
Model Name |
Validated TP Size |
---|---|
deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
8 |
meta-llama/Llama-3.1-70B-Instruct |
4 |
meta-llama/Llama-3.1-405B-Instruct |
8 |
meta-llama/Llama-3.1-8B-Instruct |
1 |
meta-llama/Llama-3.2-1B-Instruct |
1 |
meta-llama/Llama-3.2-3B-Instruct |
1 |
meta-llama/Llama-3.3-70B-Instruct |
4 |
mistralai/Mistral-7B-Instruct-v0.2 |
1 |
mistralai/Mixtral-8x7B-Instruct-v0.1 |
2 |
mistralai/Mixtral-8x22B-Instruct-v0.1 |
4 |
Qwen/Qwen2.5-7B-Instruct |
1 |
Qwen/Qwen2.5-VL-7B-Instruct |
1 |
Qwen/Qwen2.5-14B-Instruct |
1 |
Qwen/Qwen2.5-32B-Instruct |
1 |
Qwen/Qwen2.5-72B-Instruct |
4 |
meta-llama/Llama-3.2-11B-Vision-Instruct |
1 |
meta-llama/Llama-3.2-90B-Vision-Instruct |
4 |
ibm-granite/granite-8b-code-instruct-4k |
1 |
ibm-granite/granite-20b-code-instruct-8k |
1 |