Inference on Gaudi

The following sections describe the inference capabilities on Gaudi. Support for inference capabilities and TorchServe will be expanded in upcoming releases.

Habana’s PyTorch integration involves some computation on the host side in every iteration of the inference loop. To overlap the host time with the device time of previous iterations and increase throughput, invoke htcore.mark_step() after a loop iteration. If performance is satisfactory, PyTorch model.eval is sufficient.

However, if the application is latency sensitive, or the host time ends up greater than the device time due to a low batch size, HPU Graphs feature minimizes this host time. HPU Graphs usage requires minor modifications in the script, and has some limitations, but when applicable it reduces inference latency.