FP8 Calibration and Inference with vLLM
On this Page
FP8 Calibration and Inference with vLLM¶
This section provides the steps required to enable FP8 calibration and inference on Intel® Gaudi® AI accelerator via vLLM using Intel® Neural Compressor (INC) package. For more details about FP8 inference and INC, refer to Run Inference Using FP8.
Calibrating a Model¶
Running inference via vLLM on Gaudi with FP8 precision is achieved using INC.
This approach requires a model calibration procedure to generate measurements, quantization files, and configurations. For more details, see Measurement and Quantization Mechanisms.
The vllm-hpu-extension repository provides the calibrate_model.sh
script
that utilizes INC to simplify the model calibration. For the script usage and options, refer to the section below.
Note
For a full calibration procedure with the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.
The calibration procedure works with any dataset that contains
system_prompt
andquestion
fields. These fields are used to prepare a calibration dataset with prompts formatted specifically for the chosen model. It is recommended to use a public dataset utilized by MLCommons in Llama2-70b inference submission.Since measurements are device-dependent, scales collected on Gaudi 3 cannot be used on Gaudi 2 accelerators. This mismatch may lead to accuracy issues.
If the following error occurs, set a valid tensor parallelism value, e.g.,
-t 8
:RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::939524096 (896)MB
Options and Usage¶
To run the calibrate_model.sh
script, follow the steps below:
Build and install vllm-fork as described in README for Gaudi.
Clone the vllm-hpu-extension repository and move to the
calibration
subdirectory:cd /root git clone https://github.com/HabanaAI/vllm-hpu-extension.git -b v1.20.0 cd vllm-hpu-extension/calibration
Download and process the dataset .pkl file by using the download_dataset.sh script.
Run the
calibrate_model.sh
script. Refer to the script options and run examples below. The script generates themaxabs_quant_g3.json
file, which is used for FP8 inference.
Options:
Option |
Description |
---|---|
|
Prints the help message. |
|
Sets the path to the model (if stored locally) or the model ID from the Hugging Face library. |
|
Sets the path to the dataset in .pkl format. |
|
Sets the path to the output directory. |
|
Optional. Sets batch size used for running the measurements (default: 32). |
|
Optional. Sets the limit of the samples in the calibration dataset. |
|
Optional. Sets the tensor parallel size (default: 1). |
Examples:
Calibrating the Meta-Llama-3.1-405B-Instruct model using a processed dataset:
./calibrate_model.sh -m /path/to/local/llama3.1/Meta-Llama-3.1-405B-Instruct/ -d dataset-processed.pkl -o /path/to/measurements/vllm-benchmarks/inc -b 128 -t 8 -l 4096
Calibrating the Hugging Face facebook/opt-125m model using a processed dataset:
./calibrate_model.sh -m facebook/opt-125m -d dataset-processed.pkl -o inc/
Running FP8 Inference¶
Once the model calibration is completed and the measurements are collected, run FP8 inference with vLLM using the following command:
export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_quant_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8
The following configurations are required for enabling FP8 inference:
The
QUANT_CONFIG
environment variable points to the measurements in themaxabs_quant_g3.json
quantization file generated after thecalibrate_model.sh
execution.The
--quantization inc
and--kv-cache-dtype fp8_inc
parameters enable the FP8 quantization using INC andQUANT_CONFIG
.
Note
For an example of running FP8 inference on the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.
Basic Troubleshooting for OOM Errors¶
During the development phase, when evaluating a model for FP8 inference on vLLM, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the
VLLM_SKIP_WARMUP=true
environment variable.Note
You may disable warmup only for development but it is highly recommended to enable it in production.
When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:
VLLM_ENGINE_ITERATION_TIMEOUT_S
to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.VLLM_RPC_TIMEOUT
to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.