Run Inference Using HPU Graphs¶

HPU Graphs can capture a series of operations using HPU stream and replay them. They mandate device-side input while maintaining constant output addresses between invocations. For further details, see Stream APIs and HPU Graph APIs.

Note

HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited.

Only models that run completely on HPU have been tested. Models that contain CPU ops are not supported. During HPU Graphs capturing, in case the Op is not supported, the following message will appear: “… is not supported during HPU Graph capturing”.

HPU Graphs can be only used to capture and replay static graphs. Dynamic shapes are not supported.

Data-dependent dynamic flow is not supported with HPU Graphs. For example:

Data-dependent if-else statements in Python code.

Data-dependent for or while loops in Python code.

Dynamic ops whose output sizes depend on input sizes. See Dynamic Shapes due to Ops.

Capturing HPU Graphs on models containing in-place view updates is not supported.

Multi-card support for inference (DeepSpeed) using HPU Graphs is applicable only with PT_HPU_ENABLE_LAZY_COLLECTIVES=true. For further details, refer to Inference Using DeepSpeed Guide.

Please refer to the PyTorch Known Issues and Limitations section for a list of current limitations.

Reducing Host Overhead with HPU Graphs¶

Intel® Gaudi® performs best when running on graphs, as the Gaudi hardware is optimized for graph mode. While Eager mode is supported, it tends to be slower due to better hardware optimization for graphs. Since PyTorch 1.x does not have native graph support with torch.compile, the Intel Gaudi software relies on lazily accumulating operations in a Python script and managing a cache. Without using HPU Graphs, the Python interpreter encounters the operation, passes it to the PyTorch frontend, which then passes it to the Intel Gaudi backend. The Intel Gaudi software accumulates the ops and ultimately performs hashing on the series of ops to find the recipe in the cache. This process takes time, and since the host is slower compared to the device, a host bottleneck often occurs.

Using HPU Graphs allows you to avoid this overhead by managing the device-side command buffers associated with a sequence of operations. Instead of interpreting the PyTorch ops every time and recognizing if the sequence has been encountered before, the recipe representing the series of operations is encapsulated in the HPU Graphs. HPU Graphs minimizes host time, which means you become device-bound rather than experiencing extended periods of idle time.

The benefits of running inference with HPU Graphs vs. without HPU Graphs, can be seen in the Profiler trace. The Profiler trace shows the device-side work remains constant in both modes, but the dispatch latency from the host is greatly reduced with HPU Graphs.

You can find additional details on identifying and reducing host overhead in the Optimize Inference on PyTorch section.

Using HPU Graphs¶

Follow the steps in Importing PyTorch Models Manually to prepare the PyTorch model to run on Gaudi. Adding mark_step is not required with HPU Graphs as it is handled implicitly.

To run inference using HPU Graphs, create an example object to show the capture and replay of HPU Graphs:

device = torch.device('hpu')

class GraphTest(object):
def __init__(self, size: int) -> None:
    self.g = ht.hpu.HPUGraph()
    self.s = ht.hpu.Stream()
    self.size = size

# The following function shows the steps to implement the capture and replay on HPU Graphs
def wrap_func(self, first: bool) -> None:
    if first:
        with ht.hpu.stream(self.s):
            self.g.capture_begin()
            # the following code snippet is for demonstration
            a = torch.full((self.size,), 1, device=device)
            b = a
            b = b + 1
            self.g.capture_end()
    else:
        self.g.replay()

Or, use htorch.hpu.wrap_in_hpu_graph to wrap the module forward function with HPU Graphs. This wrapper captures, caches and replays the graph. htorch.hpu.wrap_in_hpu_graph(model, disable_tensor_cache=True) can be used to release cached output tensor memory after every replay:

import torch
import habana_frameworks.torch as ht

model = GetModel()
model = ht.hpu.wrap_in_hpu_graph(model)

Note

The disable_tensor_cache=True is an experimental feature and should be used with caution:

The flag frees all tensors except user outputs and view tensors. Certain tensors, such as in-place tensors, cannot be detected by HPU Graphs and may also be freed unintentionally. To avoid this, these tensors must be explicitly specified in the cache_tensors_list.

The flag automatically enables dry_run, where the intermediate values and single HPU Graphs are not evaluated until the full graph is captured. As a result, the necessary mark_step calls and internal evaluations do not update values as expected (e.g., local_scalar_dense). This can lead to accuracy issues.

Gaudi Documentation 1.21.1 documentation

Run Inference Using HPU Graphs

On this Page

Run Inference Using HPU Graphs¶

Reducing Host Overhead with HPU Graphs¶

Using HPU Graphs¶