Run Inference Using UINT4¶

This guide provides the steps required to enable UINT4 inference on your Intel® Gaudi® 3 and Intel® Gaudi® 2 AI accelerator.

When running inference on LLMs, high memory usage is often a bottleneck. Using the UINT4 data type for inference can significantly reduce memory bandwidth requirements compared to using FP8 or higher bit-width formats.

The following is currently supported:

GPTQ - Weight-Only-Quantization (WOQ) method.

nn.Linear module.

Single device only.

Lazy mode only.

The pre-quantized model should be in BF16 only.

Tested on Hugging Face Optimum for Intel Gaudi models only.

Intel Gaudi utilizes Intel® Neural Compressor (INC) API to load models with 4-bit checkpoints and adapt them for execution on Gaudi. INC supports models quantized to 4-bit using Weight-Only Quantization (WOQ).

Quantizing PyTorch Models to UINT4¶

Quantize the model with the run_clm_no_trainer.py script provided in Neural Compressor GitHub repo for GPTQ quantization:

python -u run_clm_no_trainer.py \
        --model <model_name_or_path> \
        --dataset <DATASET_NAME> \
        --quantize \
        --output_dir <tuned_checkpoint> \
        --tasks "lambada_openai" \
        --batch_size <batch_size> \
        --woq_algo GPTQ \
        --woq_bits 4 \
        --woq_group_size 128 \
        --woq_scheme asym \
        --woq_use_mse_search \
        --gptq_use_max_length

Note

Typical LLMs such as meta-llama/Llama-2-7b-hf, EleutherAI/gpt-j-6B, and facebook/opt-125m have been validated with this script.
For more information on the GPTQ and WOQ config flags, refer to this code.

Loading a WOQ Checkpoint Saved by INC¶

You can load the checkpoint created in the previous section or an existing checkpoint using the below steps in your own script. See the LlaMA 2 7B model for an example model using UINT4:

Import habana_frameworks.torch.core:

import habana_frameworks.torch.core as htcore

Call the INC load API and target the Gaudi device:

from neural_compressor.torch.quantization import load
from transformers import AutoModelForCausalLM
org_model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    **model_kwargs,
)

model = load(
    model_name_or_path=args.local_quantized_inc_model_path,
    format="default",
    device="hpu",
    original_model=org_model,
    **model_kwargs,
)

--local_quantized_inc_model_path provides the path to INC quantized model checkpoint files generated in the previous section and must be used together with the original --model_name_or_path:
```
<model run command> --model_name_or_path <model_name_or_path> --local_quantized_inc_model_path <tuned_checkpoint>
```

Loading a Hugging Face WOQ Checkpoint using INC¶

You can load a Hugging Face checkpoint using the huggingface scripts. See the LlaMA 2 7B model for an example model using UINT4:

Run the original huggingface model run command with --load_quantized_model_with_inc which invokes the INC load API:

<model run command> --load_quantized_model_with_inc

Gaudi Documentation 1.22.1 documentation

Run Inference Using UINT4

On this Page

Run Inference Using UINT4¶

Quantizing PyTorch Models to UINT4¶

Loading a WOQ Checkpoint Saved by INC¶

Loading a Hugging Face WOQ Checkpoint using INC¶