Run Inference Using UINT4
On this Page
Run Inference Using UINT4¶
This guide provides the steps required to enable UINT4 inference on your Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerator.
When running inference on large language models (LLMs), high memory usage is often the bottleneck. Therefore, using UINT4 data type for inference on large language models reduces the required memory bandwidth compared to running inference in FP8 or higher bit widths.
Note
The following is currently supported:
GPTQ - Weight-Only-Quantization (WOQ) method.
nn.Linear
module.Single device only.
Lazy mode only (default).
The pre-quantized model should be in BF16 only.
Tested on Hugging Face Optimum for Intel Gaudi models only.
Intel Gaudi utilizes Intel® Neural Compressor (INC) API to load models with 4-bit checkpoints and adjust to run on Gaudi. INC supports models that were quantized to 4-bit using Weight-Only-Quantization (WOQ).
Quantizing PyTorch Models to UINT4¶
Quantize the model with the run_clm_no_trainer.py
script provided in Neural Compressor GitHub repo for GPTQ quantization:
python -u run_clm_no_trainer.py \ --model <model_name_or_path> \ --dataset <DATASET_NAME> \ --quantize \ --output_dir <tuned_checkpoint> \ --tasks "lambada_openai" \ --batch_size <batch_size> \ --woq_algo GPTQ \ --woq_bits 4 \ --woq_group_size 128 \ --woq_scheme asym \ --woq_use_mse_search \ --gptq_use_max_lengthNote
Typical LLMs such as
meta-llama/Llama-2-7b-hf
,EleutherAI/gpt-j-6B
, andfacebook/opt-125m
have been validated with this script.For more information on the GPTQ and WOQ config flags, refer to this code.
Loading a WOQ Checkpoint Saved by INC¶
You can load the checkpoint created in the previous section or an existing checkpoint using the below steps in your own script. See the LlaMA 2 7B model for an example model using UINT4:
Import
habana_frameworks.torch.core
:import habana_frameworks.torch.core as htcore
Call the INC load API and target the Gaudi device:
from neural_compressor.torch.quantization import load from transformers import AutoModelForCausalLM org_model = AutoModelForCausalLM.from_pretrained( args.model_name_or_path, **model_kwargs, ) model = load( model_name_or_path=args.local_quantized_inc_model_path, format="default", device="hpu", original_model=org_model, **model_kwargs, )
--local_quantized_inc_model_path
provides the path to INC quantized model checkpoint files generated in the previous section and needs to be used along with the original--model_name_or_path
:<model run command> --model_name_or_path <model_name_or_path> --local_quantized_inc_model_path <tuned_checkpoint>
Loading a Hugging Face WOQ Checkpoint using INC¶
You can load a Hugging Face checkpoint using the huggingface scripts. See the LlaMA 2 7B model for an example model using UINT4:
Run the original huggingface model run command with --load_quantized_model_with_inc
which invokes the INC load API:
<model run command> --load_quantized_model_with_inc