FP8 Calibration and Inference with vLLM

This section provides the steps required to enable FP8 calibration and inference on Intel® Gaudi® AI accelerator via vLLM using Intel® Neural Compressor (INC) package. For more details about FP8 inference and INC, refer to Run Inference Using FP8.

Calibrating a Model

Running inference via vLLM on Gaudi with FP8 precision is achieved using INC. This approach requires a model calibration procedure to generate measurements, quantization files, and configurations. For more details, see Measurement and Quantization Mechanisms. The vllm-hpu-extension repository provides the calibrate_model.sh script that utilizes INC to simplify the model calibration. For the script usage and options, refer to the section below.

Note

  • For a full calibration procedure with the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.

  • The calibration procedure works with any dataset that contains system_prompt and question fields. These fields are used to prepare a calibration dataset with prompts formatted specifically for the chosen model. It is recommended to use a public dataset utilized by MLCommons in Llama2-70b inference submission.

  • Since measurements are device-dependent, scales collected on Gaudi 3 cannot be used on Gaudi 2 accelerators. This mismatch may lead to accuracy issues.

  • If the following error occurs, set a valid tensor parallelism value, e.g., -t 8:

    RuntimeError: [Rank:0] FATAL ERROR :: MODULE:PT_DEVMEM Allocation failed for size::939524096 (896)MB
    

Options and Usage

To run the calibrate_model.sh script, follow the steps below:

  1. Build and install vllm-fork as described in README for Gaudi.

  2. Clone the vllm-hpu-extension repository and move to the calibration subdirectory:

    cd /root
    git clone https://github.com/HabanaAI/vllm-hpu-extension.git -b v1.21.0
    cd vllm-hpu-extension/calibration
    
  3. Download and process the dataset .pkl file by using the download_dataset.sh script.

  4. Run the calibrate_model.sh script. Refer to the script options and run examples below. The script generates the maxabs_quant_g3.json file, which is used for FP8 inference.

Options:

Option

Description

-h

Prints the help message.

-m <path/ID>

Sets the path to the model (if stored locally) or the model ID from the Hugging Face library.

-d <path>

Sets the path to the dataset in .pkl format.

-o <path>

Sets the path to the output directory.

-b <size>

Optional. Sets batch size used for running the measurements (default: 32).

-l <samples>

Optional. Sets the limit of the samples in the calibration dataset.

-t <size>

Optional. Sets the tensor parallel size (default: 1).

Examples:

  • Calibrating the Meta-Llama-3.1-405B-Instruct model using a processed dataset:

    ./calibrate_model.sh -m /path/to/local/llama3.1/Meta-Llama-3.1-405B-Instruct/ -d dataset-processed.pkl -o /path/to/measurements/vllm-benchmarks/inc -b 128 -t 8 -l 4096
    
  • Calibrating the Hugging Face facebook/opt-125m model using a processed dataset:

    ./calibrate_model.sh -m facebook/opt-125m -d dataset-processed.pkl -o inc/
    

Running FP8 Inference

Once the model calibration is completed and the measurements are collected, run FP8 inference with vLLM using the following command:

export QUANT_CONFIG=/path/to/quant/config/inc/meta-llama-3.1-405b-instruct/maxabs_quant_g3.json
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8

The following configurations are required for enabling FP8 inference:

  • The QUANT_CONFIG environment variable points to the measurements in the maxabs_quant_g3.json quantization file generated after the calibrate_model.sh execution.

  • The --quantization inc and --kv-cache-dtype fp8_inc parameters enable the FP8 quantization using INC and QUANT_CONFIG.

Note

For an example of running FP8 inference on the Meta-Llama-3.1-70B-Instruct model, refer to the FP8 Quantization and Inference using Intel® Neural Compressor (INC) tutorial.

Reducing vLLM FP8 Warmup Time

vLLM warmup time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.

FP8 warmup time can be reduced by setting the RUNTIME_SCALE_PATCHING=1 environment variable and selecting a hardware-aligned per-tensor scale_method provided by the INC JSON config.

Note

  • This is an experimental feature. It reduces FP8 warmup time but may lower model throughput by 5-10%. Future releases will improve performance and extend support to more ops.

  • Available only with Lazy mode (PT_HPU_LAZY_MODE=1). Support of torch.compile will be added in subsequent releases.

  • Supports Llama workloads using FP8 execution of Linear layers, and casting ops between BF16 and FP8. MoE and Convolution ops are not yet supported

  • FSDPA on Gaudi 2 has a known accuracy issue when used with vLLM for single-card Llama workloads. To bypass this issue:

    • Exclude fused_scaled_dot_product_attention from INC quantization by adding it to the blocklist as described in the INC JSON config. This forces FSDPA to run in higher precision.

    • Set VLLM_PROMPT_USE_FUSEDSDPA=0 to use the standard scaled_dot_product_attention op instead, enabling quantized execution.

Basic Troubleshooting for OOM Errors

  • During the development phase, when evaluating a model for FP8 inference on vLLM, you may skip the warmup phase of the server. This helps to achieve faster testing turnaround times and can be set using the VLLM_SKIP_WARMUP=true environment variable.

    Note

    You may disable warmup only for development but it is highly recommended to enable it in production.

  • When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this, set the following environment variables:

    • VLLM_ENGINE_ITERATION_TIMEOUT_S to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.

    • VLLM_RPC_TIMEOUT to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.