FP8 Calibration and Inference with vLLM¶

This section provides the steps required to enable FP8 calibration and inference on Intel® Gaudi® AI accelerator via vLLM using Intel® Neural Compressor (INC) package. For more details about FP8 inference and INC, refer to Run Inference Using FP8.

Calibrating a Model¶

Running inference via vLLM on Gaudi with FP8 precision is achieved using INC. This approach requires a model calibration procedure to generate measurements, quantization files, and configurations. For more details, see Measurement and Quantization Mechanisms. The vllm-hpu-extension repository provides the calibrate_model.sh script that utilizes INC to simplify the model calibration. For the script usage and options, refer to the section below.

Note

For a full calibration procedure with the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.
The calibration procedure works with any dataset that contains system_prompt and question fields. These fields are used to prepare a calibration dataset with prompts formatted specifically for the chosen model. It is recommended to use a public dataset utilized by MLCommons in Llama2-70b inference submission.
Since measurements are device-dependent, scales collected on Gaudi 3 cannot be used on Gaudi 2 accelerators. This mismatch may lead to accuracy issues.

If the following error occurs, set a valid tensor parallelism value, e.g., -t 8:

RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::939524096 (896)MB

Options and Usage¶

To run the calibrate_model.sh script, follow the steps below:

Build and install vllm-fork as described in README for Gaudi.

Clone the vllm-hpu-extension repository and move to the calibration subdirectory:

cd /root
git clone https://github.com/HabanaAI/vllm-hpu-extension.git -b v1.22.1
cd vllm-hpu-extension/calibration

Download and process the dataset .pkl file by using the download_dataset.sh script.
Run the calibrate_model.sh script. Refer to the script options and run examples below. The script generates the maxabs_quant_g3.json file, which is used for FP8 inference.

Options:

Option	Description
`-h`	Prints the help message.
`-m <path/ID>`	Sets the path to the model (if stored locally) or the model ID from the Hugging Face library.
`-d <path>`	Sets the path to the dataset in .pkl format.
`-o <path>`	Sets the path to the output directory.
`-b <size>`	Optional. Sets batch size used for running the measurements (default: 32).
`-l <samples>`	Optional. Sets the limit of the samples in the calibration dataset.
`-t <size>`	Optional. Sets the tensor parallel size (default: 1).

Examples:

Calibrating the Meta-Llama-3.1-405B-Instruct model using a processed dataset:

./calibrate_model.sh -m /path/to/local/llama3.1/Meta-Llama-3.1-405B-Instruct/ -d dataset-processed.pkl -o /path/to/measurements/vllm-benchmarks/inc -b 128 -t 8 -l 4096

Calibrating the Hugging Face facebook/opt-125m model using a processed dataset:

./calibrate_model.sh -m facebook/opt-125m -d dataset-processed.pkl -o inc/

Running FP8 Inference¶

Once the model calibration is completed and the measurements are collected, run FP8 inference with vLLM using the following command:

export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_quant_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_parallel_size 8

The following configurations are required for enabling FP8 inference:

The QUANT_CONFIG environment variable points to the measurements in the maxabs_quant_g3.json quantization file generated after the calibrate_model.sh execution.
The --quantization inc and --kv-cache-dtype fp8_inc parameters enable the FP8 quantization using INC and QUANT_CONFIG.

Note

For an example of running FP8 inference on the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.

Basic Troubleshooting for OOM Errors¶

During the development phase, when evaluating a model for FP8 inference on vLLM, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the VLLM_SKIP_WARMUP=true environment variable.

Note

You may disable warmup only for development but it is highly recommended to enable it in production.
When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:
- VLLM_ENGINE_ITERATION_TIMEOUT_S to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
- VLLM_RPC_TIMEOUT to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.

Gaudi Documentation 1.22.1 documentation

FP8 Calibration and Inference with vLLM

On this Page

FP8 Calibration and Inference with vLLM¶

Calibrating a Model¶

Options and Usage¶

Running FP8 Inference¶

Basic Troubleshooting for OOM Errors¶