vLLM Quick Start Guide

This guide shows how to quickly launch vLLM on Gaudi using a prebuilt Docker image with Docker Compose which is supported on Ubuntu only. It supports model benchmarking, custom runtime parameters, and a selection of validated models — including the LLama, Mistral, and Qwen. The advanced configuration is available via environment variables or YAML files.

Running vLLM on Gaudi with Docker Compose

Follow the steps below to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.

  1. Clone the vLLM fork repository and navigate to the appropriate directory:

    git clone https://github.com/HabanaAI/vllm-fork.git
    cd vllm-fork/.cd/
    

    This ensures you have the required files and Docker Compose configurations.

  2. Set the following environment variables:

    Variable

    Description

    MODEL

    Choose a model name from the Supported Models list.

    HF_TOKEN

    Your Hugging Face token (generate one at https://huggingface.co).

    DOCKER_IMAGE

    The Docker image name or URL for the vLLM Gaudi container.

  3. Run the vLLM server using Docker Compose:

    MODEL="Qwen/Qwen2.5-14B-Instruct" \
    HF_TOKEN="<your huggingface token>" \
    DOCKER_IMAGE="<docker image url>" \
    docker compose up
    

    To automatically run benchmarking for a selected model using default settings, add the --profile benchmark up option:

    MODEL="Qwen/Qwen2.5-14B-Instruct" \
    HF_TOKEN="<your huggingface token>" \
    DOCKER_IMAGE="<docker image url>" \
    docker compose --profile benchmark up
    

    This command launches the vLLM server and runs the associated benchmark suite.

Advanced Options

The following steps cover optional advanced configurations for running the vLLM server and benchmark. These allow you to fine-tune performance, memory usage, and request handling using additional environment variables or configuration files. For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.

Supported Models

Model Name

Validated TP Size

deepseek-ai/DeepSeek-R1-Distill-Llama-70B

8

meta-llama/Llama-3.1-70B-Instruct

4

meta-llama/Llama-3.1-405B-Instruct

8

meta-llama/Llama-3.1-8B-Instruct

1

meta-llama/Llama-3.2-1B-Instruct

1

meta-llama/Llama-3.2-3B-Instruct

1

meta-llama/Llama-3.3-70B-Instruct

4

mistralai/Mistral-7B-Instruct-v0.2

1

mistralai/Mixtral-8x7B-Instruct-v0.1

2

mistralai/Mixtral-8x22B-Instruct-v0.1

4

Qwen/Qwen2.5-7B-Instruct

1

Qwen/Qwen2.5-VL-7B-Instruct

1

Qwen/Qwen2.5-14B-Instruct

1

Qwen/Qwen2.5-32B-Instruct

1

Qwen/Qwen2.5-72B-Instruct

4

meta-llama/Llama-3.2-11B-Vision-Instruct

1

meta-llama/Llama-3.2-90B-Vision-Instruct

4

ibm-granite/granite-8b-code-instruct-4k

1

ibm-granite/granite-20b-code-instruct-8k

1