Run Inference Using UINT4
On this Page
Run Inference Using UINT4¶
This guide provides the steps required to enable UINT4 inference on your Intel® Gaudi® 3 and Intel® Gaudi® 2 AI accelerator.
When running inference on LLMs, high memory usage is often a bottleneck. Using the UINT4 data type for inference can significantly reduce memory bandwidth requirements compared to using FP8 or higher bit-width formats.
The following is currently supported:
GPTQ - Weight-Only-Quantization (WOQ) method.
nn.Linear
module.Single device only.
Lazy mode only.
The pre-quantized model should be in BF16 only.
Tested on Hugging Face Optimum for Intel Gaudi models only.
Intel Gaudi utilizes Intel® Neural Compressor (INC) API to load models with 4-bit checkpoints and adapt them for execution on Gaudi. INC supports models quantized to 4-bit using Weight-Only Quantization (WOQ).
Quantizing PyTorch Models to UINT4¶
Quantize the model with the run_clm_no_trainer.py
script provided in Neural Compressor GitHub repo for GPTQ quantization:
python -u run_clm_no_trainer.py \ --model <model_name_or_path> \ --dataset <DATASET_NAME> \ --quantize \ --output_dir <tuned_checkpoint> \ --tasks "lambada_openai" \ --batch_size <batch_size> \ --woq_algo GPTQ \ --woq_bits 4 \ --woq_group_size 128 \ --woq_scheme asym \ --woq_use_mse_search \ --gptq_use_max_lengthNote
Typical LLMs such as
meta-llama/Llama-2-7b-hf
,EleutherAI/gpt-j-6B
, andfacebook/opt-125m
have been validated with this script.For more information on the GPTQ and WOQ config flags, refer to this code.
Loading a WOQ Checkpoint Saved by INC¶
You can load the checkpoint created in the previous section or an existing checkpoint using the below steps in your own script. See the LlaMA 2 7B model for an example model using UINT4:
Import
habana_frameworks.torch.core
:import habana_frameworks.torch.core as htcore
Call the INC load API and target the Gaudi device:
from neural_compressor.torch.quantization import load from transformers import AutoModelForCausalLM org_model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, **model_kwargs, ) model = load( model_name_or_path=args.local_quantized_inc_model_path, format="default", device="hpu", original_model=org_model, **model_kwargs, )
--local_quantized_inc_model_path
provides the path to INC quantized model checkpoint files generated in the previous section and must be used together with the original--model_name_or_path
:<model run command> --model_name_or_path <model_name_or_path> --local_quantized_inc_model_path <tuned_checkpoint>
Loading a Hugging Face WOQ Checkpoint using INC¶
You can load a Hugging Face checkpoint using the huggingface scripts. See the LlaMA 2 7B model for an example model using UINT4:
Run the original huggingface model run command with --load_quantized_model_with_inc
which invokes the INC load API:
<model run command> --load_quantized_model_with_inc