Inference with Quantization

Quantization refers to optimization techniques designed to improve the performance and efficiency of large language models (LLMs) and other AI applications. When running inference on LLMs, high memory usage often becomes the bottleneck. Quantization methods enable computations and tensor storage at lower bit widths than full floating-point precision. This reduces memory bandwidth requirements and accelerates computational throughput.

This guide outlines the steps required to enable inference with quantization on your Intel® Gaudi® 2 and Intel® Gaudi® 3 AI accelerators using the Intel® Neural Compressor (INC) package.

Prerequisites

The Intel® Neural Compressor (INC) library, neural_compressor.torch.quantization, is available in the Intel Gaudi PyTorch package, which comes with the Intel Gaudi Docker container as detailed in the Installation Guide. It can also be installed directly on a bare metal system.

Intel Gaudi uses the INC API to perform FP8 and UINT4 quantization. The INC is optimized for Gaudi through:

  • Using PyTorch custom ops that allow fusion and optimizations at the Intel Gaudi software graph level.

  • Using specific scale values that are dedicated to acceleration on Gaudi.

  • Using efficient memory loading to the device, loading weights one by one and immediately converting them to FP8/UINT4 to enable large models to fit on the device.

For more details on INC, refer to the original GitHub repo.

Supported Data Types

The following sections cover the types of quantization supported on Intel Gaudi with corresponding examples and code: