Run Inference Using UINT4

This guide provides the steps required to enable UINT4 inference on your Intel® Gaudi® 3 and Intel® Gaudi® 2 AI accelerator.

When running inference on LLMs, high memory usage is often a bottleneck. Using the UINT4 data type for inference can significantly reduce memory bandwidth requirements compared to using FP8 or higher bit-width formats.

The following is currently supported:

  • GPTQ - Weight-Only-Quantization (WOQ) method.

  • nn.Linear module.

  • Single device only.

  • Lazy mode only.

  • The pre-quantized model should be in BF16 only.

  • Tested on Hugging Face Optimum for Intel Gaudi models only.

Intel Gaudi utilizes Intel® Neural Compressor (INC) API to load models with 4-bit checkpoints and adapt them for execution on Gaudi. INC supports models quantized to 4-bit using Weight-Only Quantization (WOQ).

Quantizing PyTorch Models to UINT4

Quantize the model with the run_clm_no_trainer.py script provided in Neural Compressor GitHub repo for GPTQ quantization:

python -u run_clm_no_trainer.py \
        --model <model_name_or_path> \
        --dataset <DATASET_NAME> \
        --quantize \
        --output_dir <tuned_checkpoint> \
        --tasks "lambada_openai" \
        --batch_size <batch_size> \
        --woq_algo GPTQ \
        --woq_bits 4 \
        --woq_group_size 128 \
        --woq_scheme asym \
        --woq_use_mse_search \
        --gptq_use_max_length

Note

  • Typical LLMs such as meta-llama/Llama-2-7b-hf, EleutherAI/gpt-j-6B, and facebook/opt-125m have been validated with this script.

  • For more information on the GPTQ and WOQ config flags, refer to this code.

Loading a WOQ Checkpoint Saved by INC

You can load the checkpoint created in the previous section or an existing checkpoint using the below steps in your own script. See the LlaMA 2 7B model for an example model using UINT4:

  1. Import habana_frameworks.torch.core:

    import habana_frameworks.torch.core as htcore
    
  2. Call the INC load API and target the Gaudi device:

    from neural_compressor.torch.quantization import load
    from transformers import AutoModelForCausalLM
    org_model = AutoModelForCausalLM.from_pretrained(
        args.model_name_or_path,
        **model_kwargs,
    )
    
    model = load(
        model_name_or_path=args.local_quantized_inc_model_path,
        format="default",
        device="hpu",
        original_model=org_model,
        **model_kwargs,
    )
    
  3. --local_quantized_inc_model_path provides the path to INC quantized model checkpoint files generated in the previous section and must be used together with the original --model_name_or_path:

    <model run command> --model_name_or_path <model_name_or_path> --local_quantized_inc_model_path <tuned_checkpoint>
    

Loading a Hugging Face WOQ Checkpoint using INC

You can load a Hugging Face checkpoint using the huggingface scripts. See the LlaMA 2 7B model for an example model using UINT4:

Run the original huggingface model run command with --load_quantized_model_with_inc which invokes the INC load API:

<model run command> --load_quantized_model_with_inc