Run Inference Using FP8¶

This guide provides the steps required to enable FP8 inference on your Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerator.

Using FP8 data type for inference on large language models halves the required memory bandwidth compared to BF16. In addition, FP8 compute is twice as fast as BF16 compute, so even compute-bound workloads, such as offline inference on large batch sizes, benefit.

To see an example of FP8 inference with Hugging Face, refer to the Hugging Face text-generation example. Follow the full instructions to set up the Optimum for Intel Gaudi library and text-generation.

Note

Typical LLMs such as Llama2-70b, Llama2-7b, Llama3-70b, Llama3-8b, Mixtral-8x7B, Falcon-7B, Falcon-40B, Falcon-180B and Phi-2 have been validated with FP8 using the INC.

The Intel® Neural Compressor (INC) performs model measurement and quantization for FP8 in PyTorch models with Gaudi. It provides these capabilities for models that include the modules listed in Supported Modules.

To enable and run INC with Gaudi:

Modify your model script. For examples, refer to Enabling and Running INC in PyTorch Models section below.
Run INC in measurement mode to measure statistics and calculate scales based on the measurements.
Run INC in quantization mode to automatically quantize the model to FP8 where possible.

Note

INC also supports DeepSpeed models.

Measurement and Quantization Mechanisms¶

INC measures statistics, calculates scales based on the measurements, and quantizes the model to FP8 where possible. It runs in both measurement and quantization modes using the prepare and convert APIs:

Measurement mode - In this mode, INC measures the statistics of the data flowing through the model. This is achieved by replacing the Supported Modules in the model before it runs on the dataset. It measures data statistics, such as the maximum absolute value (max abs), and outputs these statistics into a file. The purpose of this mode is to measure and store the data statistics relevant for quantization mode.
Quantization mode - Once the data statistics have been measured and saved, the model can be quantized. In this mode, INC prepares the model to run in FP8. This includes loading the measurements file, calculating the scale of each tensor from its measurement, and injecting scale and cast operations into the model around operations that were selected to run in FP8. The purpose of this mode is to modify the model so that it can run using the FP8 data type, which can improve the performance of the model.

Enabling and Running INC in PyTorch Models¶

See the LlaMA 2 model for an example model quantization using INC. To run INC with your own pytorch model scripts, follow the steps below:

Call hpu_set_env() to enable inference optimizations:

import habana_frameworks.torch.core as htcore
htcore.hpu_set_env()

Call import FP8Config, convert, prepare after loading the model to set up INC. The prepare API prepares the model for measurement by replacing nn.Modules while the convert API quantizes the model. Load the INC JSON config by passing the path to FP8Config.from_json_file(args.config_file_path):
from neural_compressor.torch.quantization import FP8Config, convert, prepare config = FP8Config.from_json_file(args.config_file_path) if config.measure: model = prepare(model, config) elif config.quantize: model = convert(model, config)
Note

You should add code to parse the --config_file_path flag in your script and pass it to from_json_file(args.config_file_path) as shown above. The QUANT_CONFIG environment variable is no longer supported for loading the JSON config file, unless you parse it in your script and pass it to the from_json_file function.

If DeepSpeed is used, INC should be called after deepspeed.init_inference. The below shows a full initialization with DeepSpeed example:
import habana_frameworks.torch.core as htcore htcore.hpu_set_env() model = deepspeed.init_inference(model, **ds_inference_kwargs) model = model.module from neural_compressor.torch.quantization import FP8Config, convert, prepare config = FP8Config.from_json_file(args.config_file_path) if config.measure: model = prepare(model, config) elif config.quantize: model = convert(model, config)
The below call is also required in any inference scenario. It enables passing scales as constants to the Intel Gaudi software to allow compile-time optimizations:
htcore.hpu_initialize(model, mark_scales=True, mark_non_scales=False)
At the end of the run, call finalize_calibration which saves the measurements to file. This call does not affect quantization mode:
from neural_compressor.torch.quantization import finalize_calibration finalize_calibration(model)

Running INC in Measurement Mode¶

Create a config JSON file for measurement. The JSON is loaded by INC. Refer to Supported JSON Config File Options for more information. Use the following JSON example:
{ "mode": "MEASURE", "observer": "maxabs", "dump_stats_path": "./inc_output/measure" }
Run measurement on your model with the JSON file path in the model run command:
<model run command> --config_file_path <path to json file>

Running INC in Quantization Mode¶

Create a config JSON file for quantization. The JSON file is loaded by INC. Refer to Supported JSON Config File Options for more information. Note that the dump_stats_path attribute should have the same path used in the measurement JSON file. Use the following JSON example for per-tensor quantization:

{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "blocklist": {"types": [], "names":  ["lm_head"]},
    "dump_stats_path": "./inc_output/measure"
}

Alternatively, use this JSON example for weights-per-channel, activations-per-tensor quantization:

{
    "mode": "QUANTIZE",
    "observer": "maxabs",
    "scale_method": "ACT_MAXABS_POW2_WEIGHTS_PCS_OPT_POW2",
    "dump_stats_path": "./inc_output/measure"
}

Run your quantized model with the path to the JSON file in the model run command. PT_HPU_WEIGHT_SHARING=0 is required to free the full precision weights from the device and ensure only the FP8 weights are stored:
PT_HPU_WEIGHT_SHARING=0 <model run command> --config_file_path <path to json file>
Set the LOG_LEVEL_INC=1 environment variable to print the status of the patched modules which INC replaced, in addition to more debug prints. Search for “Patched modules” in the printed output.

Supported Modules¶

Linear nn.Module is supported and replaced by INC during quantization. In DeepSpeed, LinearAllreduce, LinearLayer, and LmHeadLinearAllreduce modules are supported. In Transformers, FalconLinear is supported. In diffusers, LoRACompatibleLinear and LoRACompatibleConv modules are supported partially. They can be applied in quantization only when their lora_layer member is not used. In case your model contains those modules with lora_layer used, add them to blocklist field in the json config file.

Note

Supported modules should be wrapped in a top level module nn.Module:

class Model(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, *args, **kwargs):
        return nn.Linear(*args, **kwargs)

Supported Functions¶

In addition to the above supported modules, torch.matmul and torch.nn.functional.softmax are supported and replaced by INC during quantization. You can simply wrap each function with an nn.module. See the below example:

class Matmul(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, *args, **kwargs):
        return torch.matmul(*args, **kwargs)

torch.nn.functional.scaled_dot_product_attention() is also supported and can be replaced during quantization. You can simply wrap each function with ModuleFusedSDPA(torch.nn.Module). See ModuleFusedSDPA for code usage example.

Note

For FusedSDPA:

Recompute mode should be enabled as detailed in hpex/kernels/FusedSDPA.
In the quantization JSON config file, the supported scale method is maxabs_hw. For more details, see Supported JSON Config File Options.

Custom Patched Modules¶

Custom modules were added to topologies and corresponding quantized modules to INC allowing quantization of more code in the user script, for example in the LlaMA 2 model. The following replaces the existing code: KVCache and ScopedLinearAllReduce.

Supported JSON Config File Options¶

The following table summarizes the options for the JSON config file:

Attribute	Description	Values
Mode	The mode, measure or quantize, to run INC with.	MEASURE - Measure statistics of all modules and emit the results to `dump_stats_path`. QUANTIZE (default) - Quantize and run the model according to the provided measurements.
Observer	The observer to measure the statistics.	maxabs (default) save - Saves all tensors to files.
Allowlist	List of nn.Module names or types to quantize. When setting an empty list, all the supported modules will be quantized by default. See Supported Modules.	Default = empty list
Blocklist	List of nn.Module names or types not to quantize. Defaults to empty list, so you may omit it from the config file.	Default = empty list
dump_stats_path	The path to save and load the measurements. The path is created up until the level before last “/”. The string after the last / will be used as prefix to all the measurement files that will be created.	Default = stats
scale_method	The method for calculating the scale from the measurement.	`unit_scale` (default) - Always use scale of 1. `maxabs_arbitrary` - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8. `maxabs_hw` - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then replace it by an appropriate HW accelerated scale. `device_for_scales` parameter determines the possible exponent-bias values. These values are automatically used instead of their corresponding scales to optimize timing. `maxabs_pow2` - Scale is calculated to stretch/compress the maxabs measurement to the full-scale of FP8 and then rounded to the power of 2. `maxabs_hw_opt_weight` - Scale of model params (weights) is chosen as the scale that provides minimal mean-square-error between quantized and non-quantized weights, from all possible HW accelerated scales. Scale of activations is calculated the same as `maxabs_hw`. `act_maxabs_pow2_weights_pcs_opt_pow2` - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as `maxabs_hw_opt_weight`. Scale of activations is calculated the same as `maxabs_pow2`. `act_maxabs_hw_weights_pcs_maxabs_pow2` - Scale of model params (weights) is calculated per-channel of the params tensor. The scale per-channel is calculated the same as `maxabs_pow2`. Scale of activations is calculated the same as `maxabs_hw`.
measure_exclude	If this attribute is not defined, the default is `OUTPUT`. Since most models do not require measuring output tensors, you can exclude it to speed up the measurement process.	`NONE` - All tensors are measured. `OUTPUT` (default) - Excludes measurement of output tensors.
scale_format	The format of scales passed to custom PyTorch ops for quantization (such as `torch.ops.hpu.fp8_gemm_v2`). The default is `scalar`.	`const` - Scales passed as tensors. `scalar` - Scales passed as scalar values. This format enables compile time and throughput optimizations. See Compile Time and Throughput Optimization.
device_for_scales	Set the exponent-bias values of the device, that may be chosen in converting a high-precision tensor (FP32 or BF16) to FP8-143 tensor. By default, the parameter value is set to the device in use. This parameter is used exclusively for the `maxabs_hw` scale method.	`GAUDI3` - In Gaudi 3, the exponent-bias range is expanded to (0, 63) to enhance accuracy. `GAUDI2` - In Gaudi 2 there are 4 possible exponent-bias values (3, 7, 11, 15), where 7 is the default exponent bias. This option is utilized in the Gaudi 3 device to minimize compilation time by decreasing the number of distinct graphs created.

Configuring Backoff Factors¶

The maxabs measurement based scaling methods support configuring backoff factors input_backoff and weight_backoff to leave some margin in the conversion of inputs and weights to FP8, respectively. For example, to account for a case where an activation with a larger absolute value than in the calibration dataset is encountered. The maxabs measured value of an activation is scaled to input_backoff*FP8_143_FULLSCALE. Similarly, the maxabs value of a weight is scaled to weight_backoff*FP8_143_FULLSCALE. The default values are input_backoff=0.25 and weight_backoff=0.5. To change these values, add the following to the quantization configuration JSON file:

"scale_params": {"input_backoff": <INPUT_BACKOFF>, "weight_backoff": <WEIGHT_BACKOFF>},

Compile Time and Throughput Optimization¶

The "scale_format": "scalar" configuration setting activates the below optimizations:

Improves overall compile time in FP8 inference by reducing the number of recipes compiled.
Reduces host time spent on launch logic in FP8. This optimization improves throughput when device time is host bound (for example if batch size is low).

Note

Improving overall compile time depends on model properties such as the number of compiled recipes and the distribution of scale values.
This mode is not relevant for PCQ.

Running Quantized Models on a Small Amount of Cards¶

Due to common memory limitations, large BF16 models do not fit on one Gaudi card, however, using FP8 precision allows large models to fit on one card. Since measurement is calculated in BF16 precision, to be able to run FP8 model on less cards than the amount used during measurement of a BF16 model, you can use the unify_measurements script, located in the Optimum for Intel Gaudi GitHub repository. The script unifies the measurements according to the grouping of cards specified.

Measure the model on a number of cards that are enough for the model to fit in BF16.
Run unify_measurements script using the measurement files created after running first step. A unified measurement is then calculated:
python unify_measurements.py -g 01234567 -m *path_to_8x_measurements* -o *path_to_output_1x_measurement*
In the above example, the measurements of cards 0-7 will be unified to a single measurement. For example, if you specify -g 0123 4567, cards 0-3 and cards 4-7 will be unified in two different measurement files:
python unify_measurements.py -g 0123 4567 -m *path_to_8x_measurements* -o *path_to_output_1x_measurement*
All different group combinations are supported. For further details, use $ unify_measurements.py --help.
Run quantization as detailed in Running INC in Quantization Mode using the unified measurement file/s.

Note

If you run the model on more than one card, set LOCAL_RANK and WORLD_SIZE environment variables properly. The environment variables are already set in DeepSpeed.

Generating INC Log Files for Gaudi¶

Use the LOG_LEVEL_INC environment variable to run INC with logging. The default logging value for INC is Info (2). All log messages with logging value >= 2 are written to the log file.
Set the HABANA_LOGS environment variable to $HOME/.habana_logs directory. When running on one card, an inc_log.txt file is placed under this directory. When running on multiple cards, each card will have its own directory, represented by its number, containing the log file. To print log messages to the screen, set ENABLE_CONSOLE=true.
To change the logging value, use the LOG_LEVEL_INC environment variable with the desired logging level number, for example: LOG_LEVEL_INC=3. The possible logging values with their representative numbers are detailed in Using Log Levels. If LOG_LEVEL_INC is not set, the LOG_LEVEL_ALL environment variable will also generate INC logs.
To disable INC logging, use LOG_LEVEL_INC=6.

In case of massive INC logging, more INC log files will be generated once the maximum log file size is reached: inc_log.txt.1, inc_log.txt.2. The default log file size is 10MB (10 * 1024 * 1024). To change the default, set the size using the INC_LOG_FILE_SIZE environment variable. The default amount of backup log files is 5 (total of 6 log files). To change the default, set the amount using the INC_LOG_FILE_AMOUNT environment variable. The latest log messages will be shown in the inc_log.txt file, while previous log messages will be shown in inc_log.txt.1 and so on.

Reducing FP8 Warmup Time¶

Warmup time for FP8 models is significantly longer than for BF16 due to additional graph compilations triggered by varying constant scale values in quantized model layers.

FP8 warmup time can be reduced by setting the RUNTIME_SCALE_PATCHING=1 environment variable and selecting a hardware-aligned per-tensor scale_method provided by the INC JSON config.

Note

This is an experimental feature. It reduces FP8 warmup time but may lower model throughput by 5-10%. Future releases will improve performance and extend support to more ops.
Available only with Lazy mode (PT_HPU_LAZY_MODE=1). Support of torch.compile will be added in subsequent releases.
Supports Llama workloads using FP8 execution of Linear layers, and casting ops between BF16 and FP8. MoE and Convolution ops are not yet supported.
FSDPA on Gaudi 2 has a known accuracy issue when used with vLLM for single-card Llama workloads. For a workaround, see reducing-vLLM-fp8-warmup-time.

Gaudi Documentation 1.21.1 documentation

Run Inference Using FP8

On this Page

Run Inference Using FP8¶

Measurement and Quantization Mechanisms¶

Enabling and Running INC in PyTorch Models¶

Running INC in Measurement Mode¶

Running INC in Quantization Mode¶

Supported Modules¶

Supported Functions¶

Custom Patched Modules¶

Supported JSON Config File Options¶

Configuring Backoff Factors¶

Compile Time and Throughput Optimization¶

Running Quantized Models on a Small Amount of Cards¶

Generating INC Log Files for Gaudi¶

Reducing FP8 Warmup Time¶