Run Inference Using HPU Graphs

HPU Graphs can capture a series of operations using HPU stream and replay them. They mandate device-side input while maintaining constant output addresses between invocations. For further details, see Stream APIs and HPU Graph APIs.


  • HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited.

  • Only models that run completely on HPU have been tested. Models that contain CPU ops are not supported. During HPU Graphs capturing, in case the Op is not supported, the following message will appear: “… is not supported during HPU Graph capturing”.

  • HPU Graphs can be only used to capture and replay static graphs. Dynamic shapes are not supported.

  • Data Dependent dynamic flow is not supported with HPU Graphs.

  • Capturing HPU Graphs on models containing in-place view updates is not supported.

  • Multi-card support for inference (DeepSpeed) using HPU Graphs is applicable only with PT_HPU_ENABLE_LAZY_COLLECTIVES=true. For further details, refer to Inference Using DeepSpeed Guide.

Please refer to the PyTorch Known Issues and Limitations section for a list of current limitations.

Reducing Host Overhead with HPU Graphs

Intel® Gaudi® performs best when running on graphs, as the Gaudi hardware is optimized for graph mode. While Eager mode is supported, it tends to be slower due to better hardware optimization for graphs. Since PyTorch 1.x does not have native graph support with torch.compile, the Intel Gaudi software relies on lazily accumulating operations in a Python script and managing a cache. Without using HPU Graphs, the Python interpreter encounters the operation, passes it to the PyTorch frontend, which then passes it to the Intel Gaudi backend. The Intel Gaudi software accumulates the ops and ultimately performs hashing on the series of ops to find the recipe in the cache. This process takes time, and since the host is slower compared to the device, a host bottleneck often occurs.

Using HPU Graphs allows you to avoid this overhead by managing the device-side command buffers associated with a sequence of operations. Instead of interpreting the PyTorch ops every time and recognizing if the sequence has been encountered before, the recipe representing the series of operations is encapsulated in the HPU Graphs. HPU Graphs minimizes host time, which means you become device-bound rather than experiencing extended periods of idle time.

The benefits of running inference with HPU Graphs vs. without HPU Graphs, can be seen in the Profiler trace. The Profiler trace shows the device-side work remains constant in both modes, but the dispatch latency from the host is greatly reduced with HPU Graphs.

You can find additional details on identifying and reducing host overhead in the Optimize Inference on PyTorch section.

Using HPU Graphs

Follow the steps in Importing PyTorch Models Manually to prepare the PyTorch model to run on Gaudi. Adding mark_step is not required with HPU Graphs as it is handled implicitly.

To run inference using HPU Graphs, create an example object to show the capture and replay of HPU Graphs:

device = torch.device('hpu')

class GraphTest(object):
def __init__(self, size: int) -> None:
    self.g = ht.hpu.HPUGraph()
    self.s = ht.hpu.Stream()
    self.size = size

# The following function shows the steps to implement the capture and replay on HPU Graphs
def wrap_func(self, first: bool) -> None:
    if first:
            # the following code snippet is for demonstration
            a = torch.full((self.size,), 1, device=device)
            b = a
            b = b + 1

Or, use htorch.hpu.wrap_in_hpu_graph to wrap the module forward function with HPU Graphs. This wrapper captures, caches and replays the graph. htorch.hpu.wrap_in_hpu_graph(model, disable_tensor_cache=True) can be used to release cached output tensor memory after every replay.

import torch
import habana_frameworks.torch as ht

model = GetModel()
model = ht.hpu.wrap_in_hpu_graph(model)