Inference Using vLLM¶

The following sections provide instructions on deploying models using vLLM with Intel® Gaudi® AI accelerator. The document is based on the vLLM Inference Server README for Gaudi. The process involves:

Creating a vLLM Docker image for Gaudi
Building and installing the vLLM Inference Server
Sending an Inference Request

For frequently asked questions about using Gaudi with vLLM, see vLLM with Intel Gaudi FAQs.

To enable FP8 calibration and inference on Gaudi via vLLM, see FP8 Calibration and Inference with vLLM.

Creating a Docker Image for Gaudi¶

Since a vLLM server is launched within a Docker container, a Docker image tailored for Gaudi is needed. Follow the instructions in Run Docker Image.

Building and Installing vLLM for Gaudi¶

Follow the instructions in Build and Install vLLM.

Sending an Inference Request¶

There are two common methods to send inference requests on vLLM:

A practical jupyter notebook utilizing these methods is provided in Gaudi Tutorials.

For a list of popular models and supported configurations validated on Gaudi, see the README for Gaudi.

Understanding vLLM Logs¶

This section uses a basic example to run the Meta-Llama-3.1-8B model on Gaudi vLLM server with default settings:

python -m vllm.entrypoints.openai.api_server --model="meta-llama/Meta-Llama-3.1-8B" --dtype=bfloat16 --block-size=128 --max-num-seqs=4 --tensor-parallel-size=1 --max-seq_len-to-capture=2048

The below shows the initial part of the server log:

  INFO 09-24 17:31:39 habana_model_runner.py:590] Pre-loading model weights on hpu:0 took 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.067 GiB of host memory (8.199 GiB/108.2 GiB used)
  INFO 09-24 17:31:39 habana_model_runner.py:636] Wrapping in HPU Graph took 0 B of device memory (15.05 GiB/94.62 GiB used) and -3.469 MiB of host memory (8.187 GiB/108.2 GiB used)
  INFO 09-24 17:31:39 habana_model_runner.py:640] Loading model weights took in total 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.056 GiB of host memory (8.188 GiB/108.2 GiB used)
  INFO 09-24 17:31:40 habana_worker.py:153] Model profiling run took 355 MiB of device memory (15.4 GiB/94.62 GiB used) and 131.4 MiB of host memory (8.316 GiB/108.2 GiB used)
  INFO 09-24 17:31:40 habana_worker.py:177] Free device memory: 79.22 GiB, 71.3 GiB usable (gpu_memory_utilization=0.9), 7.13 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.1), 64.17 GiB reserved for KV cache
  INFO 09-24 17:31:40 habana_executor.py:85] # HPU blocks: 4107, # CPU blocks: 256
  INFO 09-24 17:31:41 habana_worker.py:208] Initializing cache engine took 64.17 GiB of device memory (79.57 GiB/94.62 GiB used) and 1.015 GiB of host memory (9.329 GiB/108.2 GiB used)

This section shows the memory consumption trends of the chosen model. It includes the device memory used up in loading the model weights, running a profiling run (with dummy data and without KV Cache) and eventual usable device memory which is shared among HPUGraphs and KV Cache before the warmup phase begins. This information can be used to determine the bucketing scheme to use for warmups.

The below shows the warmup phase logs:

INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Prompt captured:24 (100.0%) used_mem:67.72 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Decode captured:1 (100.0%) used_mem:64 KiB buckets:[(4, 128)]
INFO 09-24 17:32:13 habana_model_runner.py:1620] Warmup finished in 32 secs, allocated 92.77 MiB of device memory
INFO 09-24 17:32:13 habana_executor.py:91] init_cache_engine took 64.26 GiB of device memory (79.66 GiB/94.62 GiB used) and 1.104 GiB of host memory (9.419 GiB/108.2 GiB used)

After analyzing this section of the warmup phase logs, you should have a good idea of how much free device memory remains for overhead calculations and how much more could still be be utilized by increasing gpu_memory_utilization. The user is expected to balance the memory requirements in bucketing for warmup, HPUGraphs and KV Cache based on their unique needs.

Basic Troubleshooting for OOM Errors¶

Due to various factors such as available GPU memory, model size, and input sequence length, the standard inference command may not always work for your model, potentially leading to OOM errors. The following steps help mitigate OOM errors:
- Increase gpu_memory_utilization - This addresses insufficient available memory. The vLLM pre-allocates HPU cache by using gpu_memory_utilization% of memory. By increasing this utilization, you can provide more KV cache space.
- Decrease max_num_seqs or max_num_batched_tokens - This may reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
- Increase tensor_parallel_size - This approach shards model weights, so each GPU has more memory available for KV cache.
During the development phase when evaluating a model for inference on vLLM, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the VLLM_SKIP_WARMUP=true environment variable.

Note

You may disable warmup only for development but it is highly recommended to enable it in production.

Guidelines for Performance Tuning¶

Some general guidelines for tweaking performance are noted below. Your results may vary:

Warmup should be enabled during deployment with optimal number of buckets. Warmup time depends on many factors, e.g. input and output sequence length, batch size, number of buckets, and data type. Warmup can even take a couple of hours, depending on the configuration. Refer to Managing and Reducing vLLM Warmup Time for more information about warmup time optimization strategies.
Since HPU Graphs and KV Cache occupy the same memory pool (“usable memory” determined by gpu_memory_utilization), a balance is required between the two which can be managed using the VLLM_GRAPH_RESERVED_MEM environment variable (defines the ratio of memory reserved for HPU Graphs vs KV Cache):
- Maximizing KV Cache size helps to accommodate bigger batches resulting in increased overall throughput.
- Enabling HPU Graphs reduces host overhead times and can be useful for reducing latency.
For fine-grained control, the VLLM_GRAPH_PROMPT_RATIO environment variable determines the ratio of usable graph memory reserved for prefill and decode graphs. Allocating more memory for one stage generally means faster processing of that stage.
Bucketing mechanisms can be applied. vLLM server is pre-configured optimally for heavy-duty decoding expecting lots of pending requests (see VLLM_GRAPH_DECODE_STRATEGY default maximum batch size strategy). However, during lean periods when request rate is low, this strategy may not be best suited and can be tuned to cater to lower batch sizes. As an example, tweaking the range of buckets using VLLM_DECODE_BS_BUCKET_{param} might prove useful. Refer to performance tuning knobs - VLLM_{phase}_{dim}_BUCKET_{param} for a list of 12 environment variables configuring ranges of bucketing mechanisms.
Using FP8 data type for inference on large language models halves the required memory bandwidth compared to BF16. In addition, FP8 compute is twice as fast as BF16 compute, so even compute-bound workloads, such as offline inference on large batch sizes, benefit. For more details, see FP8 Calibration and Inference with vLLM.

Gaudi Documentation 1.22.1 documentation

Inference Using vLLM

On this Page

Inference Using vLLM¶

Creating a Docker Image for Gaudi¶

Building and Installing vLLM for Gaudi¶

Sending an Inference Request¶

Understanding vLLM Logs¶

Basic Troubleshooting for OOM Errors¶

Guidelines for Performance Tuning¶