Inference on Gaudi

This section describes preliminary inference capabilities on Gaudi. Some capabilities demonstrate functionality only and are currently not fully optimized. Support for inference capabilities and TorchServe will be expanded in upcoming releases.

There are two methods for running inference:

Note

Please refer to the PyTorch Known Issues and Limitations section for a list of current limitations.

Run Inference Using Native PyTorch

Follow the steps in the PyTorch Migration Guide to prepare the PyTorch model to run on Gaudi.

Use model.eval Mode

Run the forward path (model inference) using model.eval mode. See the example below.

  1. Import the modules for PyTorch and HPU support for PyTorch:

import torch
import habana_frameworks.torch as ht
import habana_frameworks.torch.core as htcore
  1. Create the model and move it to the HPU:

# The following example uses an untrained model for demonstration purposes only
# For inference please load a model from pretrained or checkpoint (refer to `SAVING AND LOADING MODELS page <https://pytorch.org/tutorials/beginner/saving_loading_models.html>`_)
device = torch.device('hpu')
in_c, out_c = 3, 64
k_size = 7
stride = 2
conv = torch.nn.Conv2d(in_c, out_c, kernel_size=k_size, stride=stride, bias=True)
bn = torch.nn.BatchNorm2d(out_c, eps=0.001)
relu = torch.nn.ReLU()
model = torch.nn.Sequential(conv, bn, relu)
# let Pytorch optimize the model for inference
model.eval()
# place the model on HPU
model = model.to(device)
  1. Create the inputs and move them to the HPU to run model inference:

# Create inputs and move them to the HPU
N, H, W = 256, 224, 224
input = torch.randn((N,in_c,H,W),dtype=torch.float)
input_hpu = input.to(hpu)
# invoke the model
output = model(input_hpu)
# in lazy mode execution, :code:`mark_step()` must be added after model inference
htcore.mark_step()

Note

This method is recommended for running inference on Gaudi as it supports most of the models. However, the throughput may not be optimal if the output is copied back to the host in every iteration.

The V-Diffusion is a PyTorch model example for running inference on Gaudi using model.eval mode.

Use torch.jit.trace Mode

Load and save models in JIT trace format using torch.jit.trace mode. For further details on JIT format, refer to TORCHSCRIPT page.

  1. Import the modules for PyTorch and HPU support for PyTorch:

import torch
import habana_frameworks.torch as ht
import habana_frameworks.torch.core as htcore
  1. Create the model and move it to the HPU:

device = torch.device('hpu')
in_c, out_c = 3, 64
k_size = 7
stride = 2
conv = torch.nn.Conv2d(in_c, out_c, kernel_size=k_size, stride=stride, bias=True)
bn = torch.nn.BatchNorm2d(out_c, eps=0.001)
relu = torch.nn.ReLU()
model = torch.nn.Sequential(conv, bn, relu)
model.eval()
model = model.to(device)
  1. Save the model using torch.jit.trace:

N, H, W = 256, 224, 224
model_input = torch.randn((N,in_c,H,W), dtype=torch.float).to(device)
with torch.no_grad():
    trace_model = torch.jit.trace(model, (model_input), check_trace=False, strict=False)
    # Saving an HPU model with torch.jit.save is not currently supported.
    # This will be fixed in a future release.
    trace_model = trace_model.to(torch.device('cpu'))
    trace_model.save("trace_model.pt")
  1. Load the model using torch.jit.load:

# Specifying torch.device('hpu') as the map_location in torch.jit.load is not currently supported.
# This will be fixed in a future release.
model = torch.jit.load("trace_model.pt", map_location=torch.device('cpu'))
model = model.to(device)
  1. Create the inputs and move them to the HPU to run model inference:

# Create inputs and move them to the HPU
N, H, W = 256, 224, 224
input = torch.randn((N,in_c,H,W),dtype=torch.float)
input_hpu = input.to(device)
# invoke the model
output = model(input_hpu)
# in lazy mode execution, :code:`mark_step()` must be added after model inference
htcore.mark_step()

Note

JIT format is functionally correct but not yet optimized. This will be supported in a future release.

Run Inference Using HPU Graphs

Note

Running inference using HPU Graphs is currently experimental.

HPU Graphs can capture a series of operations using HPU stream and replay them. They mandate the device-side input while output addresses remain constant between invocations. For further details on Stream APIs and HPU Graph APIs, refer to Stream APIs and HPU Graph APIs.

Follow the steps in the PyTorch Migration Guide to prepare the PyTorch model to run on Gaudi. Adding mark_step is not required with HPU Graphs as it is handled implicitly.

To run inference using HPU Graphs, perform the following:

  1. Import the modules for PyTorch and HPU support for PyTorch:

import torch
import habana_frameworks.torch as ht
  1. Create an example object to show the capture and replay of HPU Graphs:

device = torch.device('hpu')

class GraphTest(object):
def __init__(self, size: int) -> None:
    self.g = ht.hpu.HPUGraph()
    self.s = ht.hpu.Stream()
    self.size = size

# The following function shows the steps to implement the capture and replay on HPU Graphs
def wrap_func(self, first: bool) -> None:
    if first:
        with ht.hpu.stream(self.s):
            self.g.capture_begin()
            # the following code snippet is for demonstration
            a = torch.full((self.size,), 1, device=device)
            b = a
            b = b + 1
            self.g.capture_end()
    else:
        self.g.replay()

The following is an example test using HPU Graphs created in the above steps.

def test_graph_capture_simple():
    # The following example shows how multiple HPU Graphs can be initialized
    gt1 = GraphTest(size=1000)
    gt2 = GraphTest(size=2000)
    gt3 = GraphTest(size=3000)
    for i in range(10):
        if i == 0:
            # HPU graphs capture is done in the first iteration
            gt1.wrap_func(True)
            gt2.wrap_func(True)
            gt3.wrap_func(True)
        else:
            # The replay of the captured HPU Graphs is done in subsequent iterations
            gt3.wrap_func(False)
            gt2.wrap_func(False)
            gt1.wrap_func(False)
    ht.hpu.synchronize()

if __name__ == "__main__":
test_graph_capture_simple()
print("test ran")

Note

  • HPU Graphs offer the best performance with minimal host overhead. However, their functionality is currently limited. Only models that run completely on HPU have been tested. Models that run partially on CPU may not work.

  • HPU Graphs can be only used to capture and replay static graphs.

  • Multi-graph and multi-model support is currently experimental.

  • Inference using HPU Graphs has been validated only on single cards.