Inference on PyTorch¶

The following sections describe the inference capabilities on Intel® Gaudi® AI accelerator. Support for inference capabilities and TorchServe will be expanded in upcoming releases.

Intel Gaudi’s PyTorch integration involves some computation on the host side in every iteration of the inference loop. To overlap the host time with the device time of previous iterations and increase throughput, invoke htcore.mark_step() after each loop iteration. If performance is satisfactory, using PyTorch model.eval should be sufficient.

However, if the application is latency-sensitive, or if the host time exceeds the device time due to a low batch size, the HPU Graphs feature minimizes this host time. Using HPU Graphs requires minor modifications in the script and has some limitations, but it reduces inference latency when applicable.

Getting Started

Getting Started with Inference on Intel Gaudi
AI Model Serving with Intel Gaudi

Supported Features

Run Inference Using HPU Graphs
Run Inference Using FP8
Run Inference Using UINT4
Optimize Inference on PyTorch
Using Gaudi Trained Checkpoints on Xeon

Model Serving

vLLM Inference Server with Intel Gaudi
Triton Inference Server with Gaudi
TorchServe Inference Server with Gaudi

Gaudi Documentation 1.21.1 documentation

Inference on PyTorch

Inference on PyTorch¶