Run Inference Using FP8
On this Page
Run Inference Using FP8¶
This guide provides the steps required to enable FP8 inference on your Gaudi2 processor. Using FP8 data type for inference on large language models halves the required memory bandwidth, which is often the bottleneck when performing LLM inference. On Gaudi2, FP8 compute is twice as fast as BF16 compute, so even compute-bound workloads, such as offline inference on large batch sizes, benefit.
The following features are supported:
Single card only.
MME operations compute in FP8, and output in BF16.
ExponentBias used for FP8 computations is 7.
Note
Running inference using FP8 is an experimental feature.
Enabling FP8 on HPU¶
To run inference using FP8, add
htcore.hpu_set_env()
API call before device and model setup:if args.device == 'hpu': import habana_frameworks.torch.core as htcore htcore.hpu_set_env()
Add the below quantization code snippet after model setup:
if args.device == 'hpu': from habana_frameworks.torch.core.quantization import _mark_params_as_const, _check_params_as_const _mark_params_as_const(model) _check_params_as_const(model) quant_dict_key = 'HB_QUANTIZATION' quant_key = 'quantization' model._buffers[quant_dict_key] = dict() model._non_persistent_buffers_set.add(quant_dict_key) model._buffers[quant_dict_key][quant_key] = True import habana_frameworks.torch.core as htcore htcore.hpu_initialize(model)
Use the below environment variables when running the model script:
USE_DEFAULT_QUANT_PARAM=true UPDATE_GRAPH_OUTPUT_MME=false ENABLE_CALC_DYNAMIC_RANGE=false ENABLE_SYNAPSE_QUANTIZATION=false ENABLE_EXPERIMENTAL_FLAGS=true <script command>
FP8 APIs¶
htcore.hpu_set_env()
- Sets inference mode._mark_params_as_const(model)
- Marks constant model params to allow constant folding performance optimizations._check_params_as_const(model)
- (Optional) Validates the marking of const params.htcore.hpu_initialize(model)
- Initializes graph compilation in inference mode.
You can find a usage code in the BLOOM7B model.
FP8 Environment Variables¶
Flag |
Value |
Description |
---|---|---|
|
True |
Sets the default quantization info for FP8 operations. The default quantization info is exponentBias = 7, per FP8 tensor. |
|
False |
Disables heavy calculation done in warmup stage, as the quantization info relevant to FP8 tensors is set to default value. |
|
False |
Sets MME that produces model output to BF16 precision. |
|
False |
Disables heavy compilation passes done in warmup stage, as the quantization info relevant to FP8 tensors is set to default value. |