Run Inference Using NF4
Run Inference Using NF4¶
This guide provides the steps required to enable NF4 inference on your Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerator.
When running inference on LLMs, high memory usage is often a bottleneck. Using NF4 data type for inference reduces the total device memory required. This can be done either by quantizing the LLM weights to NF4 when loading the model onto the device or by loading an LLM checkpoint with weights already quantized to NF4. During inference, the NF4 weights for each LLM layer are temporarily cast to a higher-precision data type (FP16, BF16, or FP32) before computation for that layer begins. The higher-precision copy is discarded immediately after the computation is completed.
Intel Gaudi utilizes Bitsandbytes (BNB) and Hugging Face Transformers Python libraries to load, quantize LLMs with NF4 weights, and run inference.
The following is an example to help you get started with NF4 inference on Gaudi.
Install Python packages:
pip install -q -U git+https://github.com/bitsandbytes-foundation/bitsandbytes.git pip install transformers pip install accelerate
Load and quantize the model to NF4, then run inference:
from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "facebook/opt-350m" model = AutoModelForCausalLM.from_pretrained( model_id, load_in_4bit=True, bnb_4bit_compute_dtype=torch.float32, bnb_4bit_quant_type="nf4", device_map="auto", torch_dtype="float32" ) tokenizer = AutoTokenizer.from_pretrained(model_id) print(model.model.decoder.embed_tokens.weight) text = "Hello my name is" device = "hpu" inputs = tokenizer(text, return_tensors="pt").to(device) outputs = model.generate(**inputs, max_new_tokens=20) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
It is recommended to use Hugging Face Optimum-Habana Python library to further increase NF4 inference throughput for selected models (e.g., Llama) by using Gaudi-specific optimizations provided in this library. Refer to test cases provided in the Hugging Face Optimum-Habana library as examples to execute NF4 inference:
Load LLM NF4 quantized checkpoint and run inference. See NF4 Checkpoint Llama-70B test.
Load LLM checkpoint, quantize to NF4 and run inference. See Regular checkpoint quantized to NF4 Llama-70B test.