Inference on PyTorch

The following sections describe the inference capabilities on Intel® Gaudi® AI accelerator. Support for inference capabilities and TorchServe will be expanded in upcoming releases.

Intel Gaudi’s PyTorch integration involves some computation on the host side in every iteration of the inference loop. To overlap the host time with the device time of previous iterations and increase throughput, invoke htcore.mark_step() after each loop iteration. If performance is satisfactory, using PyTorch model.eval should be sufficient.

However, if the application is latency-sensitive, or if the host time exceeds the device time due to a low batch size, the HPU Graphs feature minimizes this host time. Using HPU Graphs requires minor modifications in the script and has some limitations, but it reduces inference latency when applicable.