Run Inference Using UINT4
On this Page
Run Inference Using UINT4¶
This guide provides the steps required to enable UINT4 inference on your Intel® Gaudi® 2 AI accelerator. When running inference on large language models (LLMs), high memory usage is often the bottleneck. Therefore, using UINT4 data type for inference on large language models halves the required memory bandwidth compared to running inference in FP8.
Note
The following is currently supported:
GPTQ - Weight-Only-Quantization (WOQ) method.
nn.Linear
module.Single device only.
Lazy mode only (default).
The pre-quantized model should be in BF16 only.
Tested on Hugging Face Optimum for Intel Gaudi models only.
Enabling and Running UINT4 in PyTorch Models¶
Intel Gaudi utilizes INC API to load models with 4bit checkpoints and adjust to run on Gaudi 2. INC supports models that were quantized to 4bit using Weight-Only-Quantization (WOQ). See the LlaMA 2 7B model for an example model using UINT4:
Install the INC package,
neural_compressor.torch.quantization
, using the Intel Gaudi PyTorch package or Docker as detailed in the Installation Guide. You can also install INC using the Intel Gaudi Neural Compressor fork.Import
habana_frameworks.torch.core
:import habana_frameworks.torch.core as htcore
Call the INC load API and target the Gaudi device:
from neural_compressor.torch.quantization import load model = load( model_name_or_path=args.model_name_or_path, format="huggingface", device="hpu", **model_kwargs
Set the following when running your model. The
SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false
is an experimental flag which yields better performance.load_cp
invokes the INC load API:SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true <model run command> --load_cp
Note
SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false
will be removed in a future release.