Inference Using FP8

This guide provides the steps required to enable FP8 inference on your Intel® Gaudi® 2 AI accelerator. When running inference on large language models (LLM) high memory usage is often the bottleneck. Therefore, using FP8 data type for inference on large language models halves the required memory bandwidth. In addition, FP8 compute is twice as fast as BF16 compute, so even compute-bound workloads, such as offline inference on large batch sizes, benefit.

Quantization Toolkit

Intel Gaudi provides a Quantization Toolkit (HQT) containing model measurement and quantization capabilities for FP8 in PyTorch models with Gaudi 2. The Quantization Toolkit provides these capabilities for models that include the modules listed in Supported Modules. The Quantization Toolkit is optimized for Gaudi 2 by:

  • Using PyTorch custom ops that allow fusion and optimizations at the Intel Gaudi software graph level.

  • Using specific scale values that have dedicated acceleration on Gaudi 2.

  • Using efficient memory loading to the device, loading the weights one by one and immediately converting them to FP8 to allow large models to fit on the device.

To enable and run the Quantization Toolkit:

  • Add some code modifications in your model script.

  • Run HQT in measurement mode to measure statistics and calculate scales based on the measurements.

  • Run HQT in quantization mode to automatically quantize the model to FP8 where possible.

  • Run verbose logging to print patched modules.

Note

HQT also supports DeepSpeed models.

Measurement and Quantization Mechanisms

The Quantization Toolkit measures statistics, calculates scales based on the measurements, and quantizes the model to FP8 where possible. HQT runs in both measurement and quantization modes using the same API call which requires minimal changes to your code. The QUANT_CONFIG environment variable controls HQT and determines which mode should run.

  • Measurement mode - In this mode, HQT measures the statistics of the data flowing through the model. This is achieved by replacing the Supported Modules in the model before it runs on the dataset. It measures data statistics, such as the maximum absolute value (max abs), and outputs these statistics into a file. The purpose of this mode is to measure and store the data statistics relevant for quantization mode.

  • Quantization mode - Once the data statistics have been measured and saved, the model can be quantized. In this mode, HQT prepares the model to run in FP8. This includes loading the measurements file, calculating the scale of each tensor from its measurement, and injecting scale and cast operations into the model around operations that were selected to run in FP8. The purpose of this mode is to modify the model so that it can run using the FP8 data type, which can improve the performance of the model.

Supported Modules

Linear nn.Module is supported and replaced by HQT during quantization. In DeepSpeed, LinearAllreduce, LinearLayer, and LmHeadLinearAllreduce modules are supported. In Transformers, FalconLinear is supported. In diffusers, LoRACompatibleLinear and LoRACompatibleConv modules are supported partially. They can be applied in quantization only when their lora_layer member is not used. In case your model contains those modules with lora_layer used, add them to blocklist field in the json config file.

Supported Functions

In addition to the above supported modules, torch.matmul and torch.nn.functional.softmax are supported and replaced by HQT during quantization. You can simply wrap each function with an nn.module. See the below example:

class Matmul(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, *args, **kwargs):
        return torch.matmul(*args, **kwargs)

Custom Patched Modules

Custom modules were added to topologies and corresponding quantized modules to HQT allowing quantization of more code in the user script, for example in LLAMAv2 model in Optimum Habana repo. The following replaces the existing code: KVCache and ScopedLinearAllReduce

Enabling and Running HQT in PyTorch Models

The Quantization Toolkit, habana_quantization_toolkit, is installed with the Intel Gaudi PyTorch package. See the Installation Guide and On-Premise System Update for more details. See LLAMAv2 model in Optimum Habana repo for an example model using HQT.

Follow the below steps to prepare your model script using HQT:

  1. Call hpu_set_env() to enable inference optimizations:

    import habana_frameworks.torch.core as htcore
    htcore.hpu_set_env()
    
  2. Call prep_model(model) after loading the model to set up HQT. prep_model(model) replaces the supported modules with measured and quantized modules:

    from quantization_toolkit import habana_quantization_toolkit
    habana_quantization_toolkit.prep_model(model)
    

    If DeepSpeed is used, call prep_model(model) after init_inference. The below shows a full initialization with DeepSpeed example:

    import habana_frameworks.torch.core as htcore
    htcore.hpu_set_env()
    
    model = deepspeed.init_inference(model, **ds_inference_kwargs)
    model = model.module
    
    from quantization_toolkit import habana_quantization_toolkit
    habana_quantization_toolkit.prep_model(model)
    
  3. The below call is also required in any inference scenario. It enables weights constant folding in the Intel Gaudi software:

    htcore.hpu_initialize(model)
    
  4. At the end of the run, call finish_measurements which saves the measurements to file. This call does not affect quantization mode:

    habana_quantization_toolkit.finish_measurements(model)
    

Running HQT in Measurement Mode

  1. Create a config json file for measurement. The json is loaded and configured by HQT. Refer to Supported JSON Config File Options for more information. Use the following json example:

    {
        "method": "HOOKS",
        "mode": "MEASURE",
        "observer": "maxabs",
        "allowlist": {"types": [], "names":  []},
        "blocklist": {"types": [], "names":  []},
        "dump_stats_path": "./hqt_output/measure",
        "dump_stats_xlsx_path": "./hqt_output/measure/fp8stats.xlsx"
    }
    
  2. Run measurement on your model by setting the QUANT_CONFIG environment variable with the json file path in the model run command:

    QUANT_CONFIG=maxabs_measure.json <model run command>
    

Running HQT in Quantization Mode

  1. Create a config json file for quantization. The json file is loaded and configured by HQT. Refer to Supported JSON Config File Options for more information. Note that the dump_stats_path attribute should have the same path used in the measurement json file. Use the following json example for per-tensor quantization:

    {
        "method": "HOOKS",
        "mode": "QUANTIZE",
        "observer": "maxabs",
        "scale_method": "maxabs_hw",
        "allowlist": {"types": [], "names":  []},
        "blocklist": {"types": [], "names":  ["lm_head"]},
        "dump_stats_path": "./hqt_output/measure",
        "dump_stats_xlsx_path": "./hqt_output/measure/fp8stats.xlsx"
    }
    

    Alternatively, use this json example for weights-per-channel, activations-per-tensor quantization:

    {
        "method": "HOOKS",
        "mode": "QUANTIZE",
        "observer": "maxabs",
        "scale_method": "ACT_MAXABS_POW2_WEIGHTS_PCS_OPT_POW2",
        "whitelist": {"types": [], "names":  []},
        "blacklist": {"types": [], "names":  []},
        "dump_stats_path": "./hqt_output/measure",
        "dump_stats_xlsx_path": "./hqt_output/measure/fp8stats.xlsx"
    }
    
  2. Run your quantized model by setting the QUANT_CONFIG environment variable with the path to the json file in the model run command:

    QUANT_CONFIG=maxabs_quant.json <model run command>
    
  3. Set the QUANT_VERBOSE=1 environment variable to print the status of the patched modules which HQT replaced, in addition to more debug prints. Search for “Patched modules” in the printed output.

Supported JSON Config File Options

The following table summarizes the options for the json config file:

Attribute

Description

Values

Method

The mechanism to perform measurement and quantization. This is a mandatory attribute.

  • HOOKS - Replaces PT nn.modules with measurement modules or quantized modules.

Mode

The mode, measure or quantize, to run HQT with.

  • MEASURE - Measure statistics of all modules and emit the results to dump_stats_path.

  • QUANTIZE (default) - Quantize and run the model according to the provided measurements.

Observer

The observer to measure the statistics.

  • maxabs (default)

  • save - Saves all tensors to files.

Allowlist

List of nn.Module names or types to quantize. When setting an empty list, all the supported modules will be quantized by default. See Supported Modules. Not setting the list at all is not recommended as it will set the allowlist to these modules only: torch.nn.Linear, torch.nn.Conv2d, and BMM.

Default = empty list

Blocklist

List of nn.Module names or types not to quantize. Defaults to empty list, so you may omit it from the config file.

Default = empty list

dump_stats_path

The path to save and load the measurements. The path is created up until the level before last “/”. The string after the last / will be used as prefix to all the measurement files that will be created.

Default = stats

dump_stats_xlsx_path

Path to dump an excel containing statistics for analysis. Relevant only for “MEASURE” mode.

Default = stats.xlsx

scale_method

The method for calculating the scale from the measurement.

  • without_scale (default) - Convert to/from FP8 without scaling.

  • unit_scale - Always use scale of 1.

  • maxabs_hw - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then aligned to the corresponding HW accelerated scale.

  • maxabs_pow2 - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then rounded to the power of 2.

  • maxabs_hw_opt_weight - Scale of model params (weights) is chosen as the scale that provides minimal mean-square-error between quantized and non-quantized weights, from all possible HW accelerated scales. Scale of activations is calculated the same as maxabs_hw.

  • act_maxabs_pow2_weights_pcs_opt_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_hw_opt_weight. Scale of activations is calculated the same as maxabs_pow2.

  • act_maxabs_hw_weights_pcs_maxabs_pow2 - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as maxabs_pow2. Scale of activations is calculated the same as maxabs_hw.

measure_exclude

If this attribute is not defined, the default is OUTPUT. Since most models do not require measuring output tensors, you can exclude it to speed up the measurement process. If you are using Softmax module, make sure output measurement is enabled to quantize your model. Otherwise, quantization will fail.

  • NONE - All tensors are measured.

  • OUTPUT (default) - Excludes measurement of output tensors.