vLLM Inference Server with Gaudi

This document provides instructions on deploying models using vLLM with Intel® Gaudi® AI accelerator. The document is based on the vLLM Inference Server README for Gaudi. The process involves:

  • Creating a vLLM Docker image for Gaudi

  • Building and installing the vLLM Inference Server

  • Sending an Inference Request

Creating a Docker Image for Gaudi

Since a vLLM server is launched within a Docker container, a Docker image tailored for Gaudi is needed. Follow the instructions in Run Docker Image.

Building and Installing vLLM for Gaudi

Follow the instructions in Build and Install vLLM.

Sending an Inference Request

There are two common methods to send inference requests on vLLM:

A practical jupyter notebook utilizing these methods is provided in Gaudi Tutorials.

For a list of popular models and supported configurations validated on Gaudi, see the README for Gaudi.

Understanding vLLM Logs

This section uses a basic example to run the Meta-Llama-3.1-8B model on Gaudi vLLM server with default settings:

python -m vllm.entrypoints.openai.api_server --model="meta-llama/Meta-Llama-3.1-8B" --dtype=bfloat16 --block-size=128 --max-num-seqs=4 --tensor-parallel-size=1 --max-seq_len-to-capture=2048

The below shows the initial part of the server log:

  INFO 09-24 17:31:39 habana_model_runner.py:590] Pre-loading model weights on hpu:0 took 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.067 GiB of host memory (8.199 GiB/108.2 GiB used)
  INFO 09-24 17:31:39 habana_model_runner.py:636] Wrapping in HPU Graph took 0 B of device memory (15.05 GiB/94.62 GiB used) and -3.469 MiB of host memory (8.187 GiB/108.2 GiB used)
  INFO 09-24 17:31:39 habana_model_runner.py:640] Loading model weights took in total 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.056 GiB of host memory (8.188 GiB/108.2 GiB used)
  INFO 09-24 17:31:40 habana_worker.py:153] Model profiling run took 355 MiB of device memory (15.4 GiB/94.62 GiB used) and 131.4 MiB of host memory (8.316 GiB/108.2 GiB used)
  INFO 09-24 17:31:40 habana_worker.py:177] Free device memory: 79.22 GiB, 71.3 GiB usable (gpu_memory_utilization=0.9), 7.13 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.1), 64.17 GiB reserved for KV cache
  INFO 09-24 17:31:40 habana_executor.py:85] # HPU blocks: 4107, # CPU blocks: 256
  INFO 09-24 17:31:41 habana_worker.py:208] Initializing cache engine took 64.17 GiB of device memory (79.57 GiB/94.62 GiB used) and 1.015 GiB of host memory (9.329 GiB/108.2 GiB used)

This section shows the memory consumption trends of the chosen model. It includes the device memory used up in loading the model weights, running a profiling run (with dummy data and without KV Cache) and eventual usable device memory which is shared among HPUGraphs and KV Cache before the warm-up phase begins. This information can be used to determine the bucketing scheme to use for warmups.

The below shows the warmup phase logs:

INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Prompt captured:24 (100.0%) used_mem:67.72 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Decode captured:1 (100.0%) used_mem:64 KiB buckets:[(4, 128)]
INFO 09-24 17:32:13 habana_model_runner.py:1620] Warmup finished in 32 secs, allocated 92.77 MiB of device memory
INFO 09-24 17:32:13 habana_executor.py:91] init_cache_engine took 64.26 GiB of device memory (79.66 GiB/94.62 GiB used) and 1.104 GiB of host memory (9.419 GiB/108.2 GiB used)

After analyzing this section of the warm-up phase logs, you should have a good idea of how much free device memory remains for overhead calculations and how much more could still be be utilized by increasing gpu_memory_utilization. The user is expected to balance the memory requirements in bucketing for warmup, HPUGraphs and KV Cache based on their unique needs.

Basic Troubleshooting for Out of Memory Errors

During the development phase when evaluating a model for inference on vLLM, you may skip the warm-up phase of the server. This helps in faster testing turn-around times and is set using the VLLM_SKIP_WARMUP=true environment variable.

export VLLM_SKIP_WARMUP="true"

Note

You may disable warmup only for development but it is highly recommended to enable it in production.

Due to various factors such as available GPU memory, model size, and the typical length of input sequences in your use case, the typical inference command may not always work for your model.

The following steps address Out of Memory related errors:

  • Increase gpu_memory_utilization - This addresses insufficient available memory. The vLLM pre-allocates HPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.

  • Decrease max_num_seqs or max_num_batched_tokens - This may reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.

  • Increase tensor_parallel_size - This approach shards model weights, so each GPU has more memory available for KV cache.

Guidelines for Performance Tuning

Some general guidelines for tweaking performance are noted below. Your results may vary:

  • Warmup should be enabled during deployment with optimal number of buckets. Warmup time depends on many factors, e.g. input and output sequence length, batch size, number of buckets, and data type. Warmup can even take a couple of hours, depending on the configuration.

  • Since HPU Graphs and KV Cache occupy the same memory pool (“usable memory” determined by gpu_memory_utilization), a balance is required between the two which can be managed using the VLLM_GRAPH_RESERVED_MEM environment variable (defines the ratio of memory reserved for HPU Graphs vs KV Cache):

    • Maximizing KV Cache size helps to accommodate bigger batches resulting in increased overall throughput.

    • Enabling HPU Graphs reduces host overhead times and can be useful for reducing latency.

  • For fine-grained control, the VLLM_GRAPH_PROMPT_RATIO environment variable determines the ratio of usable graph memory reserved for prefill and decode graphs. Allocating more memory for one stage generally means faster processing of that stage.

  • Bucketing mechanisms can be applied. vLLM server is pre-configured optimally for heavy-duty decoding expecting lots of pending requests (see VLLM_GRAPH_DECODE_STRATEGY default maximum batch size strategy). However, during lean periods when request rate is low, this strategy may not be best suited and can be tuned to cater to lower batch sizes. As an example, tweaking the range of buckets using VLLM_DECODE_BS_BUCKET_{param} might prove useful. Refer the performance tuning knobs - VLLM_{phase}_{dim}_BUCKET_{param} for a list of 12 environment variables configuring ranges of bucketing mechanisms.