FP8 Calibration and Inference with vLLM

This section provides the steps required to enable FP8 calibration and inference on Intel® Gaudi® AI accelerator via vLLM using Intel® Neural Compressor (INC) package. For more details about FP8 inference and INC, refer to Run Inference Using FP8.

Calibrating a Model

Running inference via vLLM on Gaudi with FP8 precision is achieved using INC. This approach requires a model calibration procedure to generate measurements, quantization files, and configurations. For more details, see Measurement and Quantization Mechanisms. The vllm-hpu-extension repository provides the calibrate_model.sh script that utilizes INC to simplify the model calibration. For the script usage and options, refer to the section below.

Note

  • For a full calibration procedure with the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.

  • The calibration procedure works with any dataset that contains system_prompt and question fields. These fields are used to prepare a calibration dataset with prompts formatted specifically for the chosen model. It is recommended to use a public dataset utilized by MLCommons in Llama2-70b inference submission.

  • Since measurements are device-dependent, scales collected on Gaudi 3 cannot be used on Gaudi 2 accelerators. This mismatch may lead to accuracy issues.

  • If the following error occurs, set a valid tensor parallelism value, e.g., -t 8:

    RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::939524096 (896)MB
    

Options and Usage

To run the calibrate_model.sh script, follow the steps below:

  1. Build and install vllm-fork as described in README for Gaudi.

  2. Clone the vllm-hpu-extension repository and move to the calibration subdirectory:

    cd /root
    git clone https://github.com/HabanaAI/vllm-hpu-extension.git -b v1.20.0
    cd vllm-hpu-extension/calibration
    
  3. Download and process the dataset .pkl file by using the download_dataset.sh script.

  4. Run the calibrate_model.sh script. Refer to the script options and run examples below. The script generates the maxabs_quant_g3.json file, which is used for FP8 inference.

Options:

Option

Description

-h

Prints the help message.

-m <path/ID>

Sets the path to the model (if stored locally) or the model ID from the Hugging Face library.

-d <path>

Sets the path to the dataset in .pkl format.

-o <path>

Sets the path to the output directory.

-b <size>

Optional. Sets batch size used for running the measurements (default: 32).

-l <samples>

Optional. Sets the limit of the samples in the calibration dataset.

-t <size>

Optional. Sets the tensor parallel size (default: 1).

Examples:

  • Calibrating the Meta-Llama-3.1-405B-Instruct model using a processed dataset:

    ./calibrate_model.sh -m /path/to/local/llama3.1/Meta-Llama-3.1-405B-Instruct/ -d dataset-processed.pkl -o /path/to/measurements/vllm-benchmarks/inc -b 128 -t 8 -l 4096
    
  • Calibrating the Hugging Face facebook/opt-125m model using a processed dataset:

    ./calibrate_model.sh -m facebook/opt-125m -d dataset-processed.pkl -o inc/
    

Running FP8 Inference

Once the model calibration is completed and the measurements are collected, run FP8 inference with vLLM using the following command:

export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_quant_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8

The following configurations are required for enabling FP8 inference:

  • The QUANT_CONFIG environment variable points to the measurements in the maxabs_quant_g3.json quantization file generated after the calibrate_model.sh execution.

  • The --quantization inc and --kv-cache-dtype fp8_inc parameters enable the FP8 quantization using INC and QUANT_CONFIG.

Note

For an example of running FP8 inference on the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.

Basic Troubleshooting for OOM Errors

  • During the development phase, when evaluating a model for FP8 inference on vLLM, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the VLLM_SKIP_WARMUP=true environment variable.

    Note

    You may disable warmup only for development but it is highly recommended to enable it in production.

  • When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:

    • VLLM_ENGINE_ITERATION_TIMEOUT_S to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.

    • VLLM_RPC_TIMEOUT to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.