vLLM Quick Start Guide¶

This guide shows how to quickly launch vLLM on Gaudi using a prebuilt Docker image with Docker Compose which is supported on Ubuntu only. It supports model benchmarking, custom runtime parameters, and a selection of validated models — including the LLama, Mistral, and Qwen. The advanced configuration is available via environment variables or YAML files.

Running vLLM on Gaudi with Docker Compose¶

Follow the steps below to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.

Clone the vLLM fork repository and navigate to the appropriate directory:
```
git clone https://github.com/HabanaAI/vllm-fork.git
cd vllm-fork/.cd/
```
This ensures you have the required files and Docker Compose configurations.

Set the following environment variables:

Variable	Description
`MODEL`	Choose a model name from the Supported Models list.
`HF_TOKEN`	Your Hugging Face token (generate one at https://huggingface.co).
`DOCKER_IMAGE`	The Docker image name or URL for the vLLM Gaudi container. When using the Gaudi repository, make sure to select Docker images with the vllm-installer* prefix in the file name.

Run the vLLM server using Docker Compose:

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.1/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
docker compose up

To automatically run benchmarking for a selected model using default settings, add the --profile benchmark up option:

MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.1/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
docker compose --profile benchmark up

This command launches the vLLM server and runs the associated benchmark suite.

Advanced Options¶

The following steps cover optional advanced configurations for running the vLLM server and benchmark. These allow you to fine-tune performance, memory usage, and request handling using additional environment variables or configuration files. For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.

Run vLLM Using Docker Compose with Custom Parameters

To override default settings, you can provide additional environment variables when starting the server. This advanced method allows fine-tuning for performance and memory usage.

Environment variables:

Variable

Description

PT_HPU_LAZY_MODE

Enables Lazy execution mode, potentially improving performance by batching operations.

VLLM_SKIP_WARMUP

Skips the model warmup phase to reduce startup time (may affect initial latency).

MAX_MODEL_LEN

Sets the maximum supported sequence length for the model.

MAX_NUM_SEQS

Specifies the maximum number of sequences processed concurrently.

TENSOR_PARALLEL_SIZE

Defines the degree of tensor parallelism.

VLLM_EXPONENTIAL_BUCKETING

Enables or disables exponential bucketing for warmup strategy.

VLLM_DECODE_BLOCK_BUCKET_STEP

Configures the step size for decode block allocation, affecting memory granularity.

VLLM_DECODE_BS_BUCKET_STEP

Sets the batch size step for decode operations, impacting how decode batches are grouped.

VLLM_PROMPT_BS_BUCKET_STEP

Adjusts the batch size step for prompt processing.

VLLM_PROMPT_SEQ_BUCKET_STEP

Controls the step size for prompt sequence allocation.

Example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.1/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
TENSOR_PARALLEL_SIZE=1 \
MAX_MODEL_LEN=2048 \
docker compose up

Variable	Description
`PT_HPU_LAZY_MODE`	Enables Lazy execution mode, potentially improving performance by batching operations.
`VLLM_SKIP_WARMUP`	Skips the model warmup phase to reduce startup time (may affect initial latency).
`MAX_MODEL_LEN`	Sets the maximum supported sequence length for the model.
`MAX_NUM_SEQS`	Specifies the maximum number of sequences processed concurrently.
`TENSOR_PARALLEL_SIZE`	Defines the degree of tensor parallelism.
`VLLM_EXPONENTIAL_BUCKETING`	Enables or disables exponential bucketing for warmup strategy.
`VLLM_DECODE_BLOCK_BUCKET_STEP`	Configures the step size for decode block allocation, affecting memory granularity.
`VLLM_DECODE_BS_BUCKET_STEP`	Sets the batch size step for decode operations, impacting how decode batches are grouped.
`VLLM_PROMPT_BS_BUCKET_STEP`	Adjusts the batch size step for prompt processing.
`VLLM_PROMPT_SEQ_BUCKET_STEP`	Controls the step size for prompt sequence allocation.

Run vLLM and Benchmark with Custom Parameters

You can customize benchmark behavior by setting additional environment variables before running Docker Compose.

Benchmark parameters:

Variable

Description

INPUT_TOK

Number of input tokens per prompt.

OUTPUT_TOK

Number of output tokens to generate per prompt.

CON_REQ

Number of concurrent requests during benchmarking.

NUM_PROMPTS

Total number of prompts to use in the benchmark.

Example:
MODEL="Qwen/Qwen2.5-14B-Instruct" \
HF_TOKEN="<your huggingface token>" \
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/1.22.1/ubuntu22.04/habanalabs/vllm-installer-2.7.1:latest" \
INPUT_TOK=128 \
OUTPUT_TOK=128 \
CON_REQ=16 \
NUM_PROMPTS=64 \
docker compose --profile benchmark up
This launches the vLLM server and runs the benchmark using your specified parameters.

Variable	Description
`INPUT_TOK`	Number of input tokens per prompt.
`OUTPUT_TOK`	Number of output tokens to generate per prompt.
`CON_REQ`	Number of concurrent requests during benchmarking.
`NUM_PROMPTS`	Total number of prompts to use in the benchmark.

Run vLLM and Benchmark Using Configuration Files

You can also configure the server and benchmark via YAML configuration files. Set the following environment variables:

Variable

Description

VLLM_SERVER_CONFIG_FILE

Path to the server config file inside the Docker container.

VLLM_SERVER_CONFIG_NAME

Name of the server config section.

VLLM_BENCHMARK_CONFIG_FILE

Path to the benchmark config file inside the container.

VLLM_BENCHMARK_CONFIG_NAME

Name of the benchmark config section.

Example:
HF_TOKEN=<your huggingface token> \
VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
docker compose --profile benchmark up
Note

When using configuration files, you do not need to set the MODEL variable as the model details are included in the config files. However, the HF_TOKEN flag is still required.

Variable	Description
`VLLM_SERVER_CONFIG_FILE`	Path to the server config file inside the Docker container.
`VLLM_SERVER_CONFIG_NAME`	Name of the server config section.
`VLLM_BENCHMARK_CONFIG_FILE`	Path to the benchmark config file inside the container.
`VLLM_BENCHMARK_CONFIG_NAME`	Name of the benchmark config section.

Supported Models¶

Model Name	Validated TP Size
deepseek-ai/DeepSeek-R1-Distill-Llama-70B	8
meta-llama/Llama-3.1-70B-Instruct	4
meta-llama/Llama-3.1-405B-Instruct	8
meta-llama/Llama-3.1-8B-Instruct	1
meta-llama/Llama-3.2-1B-Instruct	1
meta-llama/Llama-3.2-3B-Instruct	1
meta-llama/Llama-3.3-70B-Instruct	4
mistralai/Mistral-7B-Instruct-v0.2	1
mistralai/Mixtral-8x7B-Instruct-v0.1	2
mistralai/Mixtral-8x22B-Instruct-v0.1	4
Qwen/Qwen2.5-7B-Instruct	1
Qwen/Qwen2.5-VL-7B-Instruct	1
Qwen/Qwen2.5-14B-Instruct	1
Qwen/Qwen2.5-32B-Instruct	1
Qwen/Qwen2.5-72B-Instruct	4
meta-llama/Llama-3.2-11B-Vision-Instruct	1
meta-llama/Llama-3.2-90B-Vision-Instruct	4
ibm-granite/granite-8b-code-instruct-4k	1
ibm-granite/granite-20b-code-instruct-8k	1

Gaudi Documentation 1.22.1 documentation

vLLM Quick Start Guide

On this Page

vLLM Quick Start Guide¶

Running vLLM on Gaudi with Docker Compose¶

Advanced Options¶

Supported Models¶