vLLM with Intel Gaudi FAQs¶

Prerequisites and System Requirements¶

What are the system requirements for running vLLM on Intel® Gaudi®?

Ubuntu 22.04 LTS OS.
Python 3.10.
Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.
Intel Gaudi software version 1.18.0 and above.

What is vLLM-fork for Gaudi and where can I find this GitHub repository?

Intel develops and maintains its own optimized fork of the original vLLM project called vLLM-fork.
Even though Intel Gaudi is already supported on the original vLLM project, it may not have support for all the features on current Intel Gaudi drivers/firmware. The vLLM-fork offers new and experimental features of vLLM aligned with and guaranteed to run on the latest Gaudi drivers/firmware.

How do I verify that the Intel Gaudi software is installed correctly?

Run hl-smi to check if Gaudi accelerators are visible. Refer to System Verifications and Final Tests for more details.

Run apt list --installed | grep habana to verify installed packages. The output should look similar to the below:

$ apt list --installed | grep habana
habanalabs-container-runtime
habanalabs-dkms
habanalabs-firmware-tools
habanalabs-graph
habanalabs-qual
habanalabs-rdma-core
habanalabs-thunk
habanatools

Check the installed Python packages by running pip list | grep habana and pip list | grep neural. The output should look similar to the below:

$ pip list | grep habana
habana_gpu_migration              1.19.0.561
habana-media-loader               1.19.0.561
habana-pyhlml                     1.19.0.561
habana-torch-dataloader           1.19.0.561
habana-torch-plugin               1.19.0.561
lightning-habana                  1.6.0
Pillow-SIMD                       9.5.0.post20+habana

$ pip list | grep neural
neural_compressor_pt              3.2

How can I quickly set up the environment for vLLM using Docker?

Use the Dockerfile.hpu file provided in the HabanaAI/vLLM-fork GitHub repo to build and run a container with the latest Intel Gaudi software release.
For more details, see Quick Start Using Dockerfile.

Building and Installing vLLM¶

How can I install vLLM on Intel Gaudi?

There are three different installation methods:
- (Recommended) Install the stable version from the HabanaAIvLLM-fork GitHub repo. This version is most suitable for production deployments.
- Install the latest version from the HabanaAI/vLLM-fork GitHub repo. This version is suitable for developers who would like to work on experimental code and new features that are still being tested.
- Install from the main vLLM source GitHub repo. This version is suitable for developers who would like to work with the official vLLM-project but may not have the latest Intel Gaudi features.

Examples and Model Support¶

Which models and configurations have been validated on Gaudi 2 and Gaudi 3 devices?

Various Llama 2, Llama 3 and Llama 3.1 models (7B, 8B and 70B versions). Refer to Llama-3.1 jupyter notebook example.
Mistral and Mixtral models.
Different tensor parallelism configurations (single HPU, 2x, and 8x HPU).
See Supported Configurations for more details.

Features and Support¶

Which key features does vLLM support on Intel Gaudi?

Offline Batched Inference. See Gaudi-tutorials for a jupyter notebook example.
OpenAI-Compatible Server See Gaudi-tutorials for a jupyter notebook example.
Paged KV cache optimized for Gaudi devices.
Speculative decoding (experimental).
Tensor parallel inference (single-node multi-HPU).
FP8 models and KV Cache quantization and calibration with Intel Neural Compressor (INC).
See Supported Features for more details.

Performance Tuning¶

Which execution modes does vLLM support on Intel Gaudi

torch.compile (experimental).
PyTorch Eager mode (experimental).
HPU Graphs (recommended for best performance).
PyTorch Lazy mode.
See Execution Modes for more details.

How does the bucketing mechanism work in vLLM for Intel Gaudi?

The bucketing mechanism optimizes performance by grouping tensor shapes. This reduces the number of required graphs and minimizes compilations during server runtime.
Buckets are determined by parameters for batch size and sequence length.
See Bucketing Mechanism for more details.

What should I do if a request exceeds the maximum bucket size?

Consider increasing the upper bucket boundaries using environment variables to avoid potential latency increases due to graph compilation.

Troubleshooting¶

How to troubleshoot Out-of-Memory errors encountered while running vLLM on Intel Gaudi?

Increase --gpu-memory-utilization (default: 0.9) - This addresses insufficient available memory per card.
Increase --tensor-parallel-size (default: 1) - This approach shards model weights across the devices and may help in loading a model (which is too big for a single card) across multiple cards.
Disable HPU Graphs completely (switch to any other execution mode) to maximize KV Cache space allocation.
See Understanding vLLM on Gaudi tutorial for more details.

Gaudi Documentation 1.20.0 documentation

vLLM with Intel Gaudi FAQs

On this Page