vLLM with Intel Gaudi FAQs

Prerequisites and System Requirements

What are the system requirements for running vLLM on Intel® Gaudi®?

  • Ubuntu 22.04 LTS OS.

  • Python 3.10.

  • Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.

  • Intel Gaudi software version 1.18.0 and above.

What is vLLM-fork for Gaudi and where can I find this GitHub repository?

  • Intel develops and maintains its own optimized fork of the original vLLM project called vLLM-fork.

  • Even though Intel Gaudi is already supported on the original vLLM project, it may not have support for all the features on current Intel Gaudi drivers/firmware. The vLLM-fork offers new and experimental features of vLLM aligned with and guaranteed to run on the latest Gaudi drivers/firmware.

How do I verify that the Intel Gaudi software is installed correctly?

  • Run hl-smi to check if Gaudi accelerators are visible. Refer to System Verifications and Final Tests for more details.

  • Run apt list --installed | grep habana to verify installed packages. The output should look similar to the below:

    $ apt list --installed | grep habana
    habanalabs-container-runtime
    habanalabs-dkms
    habanalabs-firmware-tools
    habanalabs-graph
    habanalabs-qual
    habanalabs-rdma-core
    habanalabs-thunk
    habanatools
    
  • Check the installed Python packages by running pip list | grep habana and pip list | grep neural. The output should look similar to the below:

    $ pip list | grep habana
    habana_gpu_migration              1.19.0.561
    habana-media-loader               1.19.0.561
    habana-pyhlml                     1.19.0.561
    habana-torch-dataloader           1.19.0.561
    habana-torch-plugin               1.19.0.561
    lightning-habana                  1.6.0
    Pillow-SIMD                       9.5.0.post20+habana
    
    $ pip list | grep neural
    neural_compressor_pt              3.2
    

How can I quickly set up the environment for vLLM using Docker?

  • Use the Dockerfile.hpu file provided in the HabanaAI/vLLM-fork GitHub repo to build and run a container with the latest Intel Gaudi software release.

  • For more details, see Quick Start Using Dockerfile.

Building and Installing vLLM

How can I install vLLM on Intel Gaudi?

  • There are three different installation methods:

    • (Recommended) Install the stable version from the HabanaAIvLLM-fork GitHub repo. This version is most suitable for production deployments.

    • Install the latest version from the HabanaAI/vLLM-fork GitHub repo. This version is suitable for developers who would like to work on experimental code and new features that are still being tested.

    • Install from the main vLLM source GitHub repo. This version is suitable for developers who would like to work with the official vLLM-project but may not have the latest Intel Gaudi features.

Examples and Model Support

Which models and configurations have been validated on Gaudi 2 and Gaudi 3 devices?

  • Various Llama 2, Llama 3 and Llama 3.1 models (7B, 8B and 70B versions). Refer to Llama-3.1 jupyter notebook example.

  • Mistral and Mixtral models.

  • Different tensor parallelism configurations (single HPU, 2x, and 8x HPU).

  • See Supported Configurations for more details.

Features and Support

Which key features does vLLM support on Intel Gaudi?

Performance Tuning

Which execution modes does vLLM support on Intel Gaudi

  • torch.compile (experimental).

  • PyTorch Eager mode (experimental).

  • HPU Graphs (recommended for best performance).

  • PyTorch Lazy mode.

  • See Execution Modes for more details.

How does the bucketing mechanism work in vLLM for Intel Gaudi?

  • The bucketing mechanism optimizes performance by grouping tensor shapes. This reduces the number of required graphs and minimizes compilations during server runtime.

  • Buckets are determined by parameters for batch size and sequence length.

  • See Bucketing Mechanism for more details.

What should I do if a request exceeds the maximum bucket size?

  • Consider increasing the upper bucket boundaries using environment variables to avoid potential latency increases due to graph compilation.

Troubleshooting

How to troubleshoot Out-of-Memory errors encountered while running vLLM on Intel Gaudi?

  • Increase --gpu-memory-utilization (default: 0.9) - This addresses insufficient available memory per card.

  • Increase --tensor-parallel-size (default: 1) - This approach shards model weights across the devices and may help in loading a model (which is too big for a single card) across multiple cards.

  • Disable HPU Graphs completely (switch to any other execution mode) to maximize KV Cache space allocation.

  • See Understanding vLLM on Gaudi tutorial for more details.