Getting Started with Inference on Intel Gaudi

This guide provides simple steps for preparing a PyTorch model to run inference on Intel® Gaudi® AI accelerator.

Make sure to install the PyTorch packages provided by Intel Gaudi. To set up the PyTorch environment, refer to the Installation Guide. The supported PyTorch versions are listed in the Support Matrix.

Once you are ready to migrate PyTorch models from GPU-based architecture to Gaudi, you can use the GPU Migration Toolkit. The GPU Migration toolkit automates the process of migration by replacing all Python API calls that have dependencies on GPU libraries with Gaudi-specific API calls, allowing you to run your model with minimal modifications.

Note

Creating a Simple Inference Example Using model.eval

The following sections provide two inference examples using Eager mode with torch.compile and Lazy mode. For further details, refer to PyTorch Gaudi Theory of Operations.

Example with Eager Mode and torch.compile

Follow the below steps to run an inference example:

  1. Download the pre-trained weights for the model:

    wget https://vault.habana.ai/artifactory/misc/inference/mnist/mnist-epoch_20.pth
    
  2. Create a file named example_inference.py with the code below:

     1import os
     2import sys
     3import torch
     4import time
     5from torch.utils.data import DataLoader
     6from torchvision import transforms, datasets
     7import torch.nn as nn
     8import torch.nn.functional as F
     9import habana_frameworks.torch.core as htcore
    10
    11class Net(nn.Module):
    12    def __init__(self):
    13        super(Net, self).__init__()
    14        self.fc1   = nn.Linear(784, 256)
    15        self.fc2   = nn.Linear(256, 64)
    16        self.fc3   = nn.Linear(64, 10)
    17    def forward(self, x):
    18        out = x.view(-1,28*28)
    19        out = F.relu(self.fc1(out))
    20        out = F.relu(self.fc2(out))
    21        out = self.fc3(out)
    22        out = F.log_softmax(out, dim=1)
    23        return out
    24
    25model = Net()
    26checkpoint = torch.load('mnist-epoch_20.pth')
    27model.load_state_dict(checkpoint)
    28
    29model = model.eval()
    30model = model.to("hpu")
    31
    32model = torch.compile(model,backend="hpu_backend")
    33
    34transform=transforms.Compose([
    35        transforms.ToTensor(),
    36        transforms.Normalize((0.1307,), (0.3081,))])
    37
    38data_path = './data'
    39test_dataset = datasets.MNIST(data_path, train=False, download=True, transform=transform)
    40test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32)
    41
    42correct = 0
    43with torch.no_grad():
    44    for data, label in test_loader:
    45        data = data.to("hpu")
    46        label = label.to("hpu")
    47        output = model(data)
    48        correct += output.argmax(1).eq(label).sum().item()
    49
    50accuracy = correct / len(test_loader.dataset) * 100
    51print('Inference with torch.compile Completed. Accuracy: {:.2f}%'.format(accuracy))
    

The example_inference.py presents a basic PyTorch code example with torch.compile. The Intel Gaudi-specific lines are explained below:

  • Line 9 - Import habana_frameworks.torch.core:

    import habana_frameworks.torch.core as htcore
    
  • Line 30 - Target the Gaudi device for model execution:

    model = model.to("hpu")
    
  • Line 32 - Wrap the model in torch.compile function and set the backend to hpu_backend:

    model = torch.compile(model,backend="hpu_backend")
    

    If you want to run the model in Eager mode without torch.compile, comment out this line.

  • Lines 45 and 46 - Target the Gaudi device for dataloader and label:

    data = data.to("hpu")
    label = label.to("hpu")
    

Executing the Example

After creating the example_inference.py, perform the following:

  1. Set PYTHON to Python executable:

    export PYTHON=/usr/bin/python3.10
    

    Note

    The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

  2. Execute the example_inference.py:

    $PYTHON example_inference.py
    

Example with Lazy Mode

Follow the below steps to run an inference example:

  1. Download the pre-trained weights for the model:

    wget https://vault.habana.ai/artifactory/misc/inference/mnist/mnist-epoch_20.pth
    
  2. Create a file named example_inference_lazy.py with the code below:

     1import os
     2import sys
     3import torch
     4import time
     5import habana_frameworks.torch.core as htcore
     6from torch.utils.data import DataLoader
     7from torchvision import transforms, datasets
     8import torch.nn as nn
     9import torch.nn.functional as F
    10
    11class Net(nn.Module):
    12    def __init__(self):
    13        super(Net, self).__init__()
    14        self.fc1   = nn.Linear(784, 256)
    15        self.fc2   = nn.Linear(256, 64)
    16        self.fc3   = nn.Linear(64, 10)
    17    def forward(self, x):
    18        out = x.view(-1,28*28)
    19        out = F.relu(self.fc1(out))
    20        out = F.relu(self.fc2(out))
    21        out = self.fc3(out)
    22        out = F.log_softmax(out, dim=1)
    23        return out
    24
    25model = Net()
    26checkpoint = torch.load('mnist-epoch_20.pth')
    27model.load_state_dict(checkpoint)
    28
    29model = model.eval()
    30model = model.to("hpu")
    31
    32transform=transforms.Compose([
    33        transforms.ToTensor(),
    34        transforms.Normalize((0.1307,), (0.3081,))])
    35
    36data_path = './data'
    37test_kwargs = {'batch_size': 32}
    38dataset1 = datasets.MNIST(data_path, train=False, download=True, transform=transform)
    39test_loader = torch.utils.data.DataLoader(dataset1,**test_kwargs)
    40
    41correct = 0
    42for batch_idx, (data, label) in enumerate(test_loader):
    43    data = data.to("hpu")
    44    output = model(data)
    45    htcore.mark_step()
    46    correct += output.max(1)[1].eq(label).sum()
    47
    48print('Accuracy: {:.2f}%'.format(100. * correct / (len(test_loader) * 32)))
    

The example_inference_lazy.py presents a basic PyTorch code example. The Intel Gaudi-specific lines are explained below:

  • Line 5 - Import habana_frameworks.torch.core:

    import habana_frameworks.torch.core as htcore
    
  • Line 30 - Target the Gaudi device for the model execution:

    model = model.to("hpu")
    
  • Line 43 – Target the Gaudi device for dataloader:

    data = data.to("hpu")
    
  • Line 45 - In Lazy mode, htcore.mark_step() must be added after the inference step output = model(data):

    htcore.mark_step()
    

    htcore.mark_step() indicates the end of the inference loop so that graph accumulation can be stopped. If htcore.mark_step() is not invoked, all inference loops merge into one graph, increasing memory consumption and preventing host and device time overlap. For further details on mark_step, refer to mark_step section.

Executing the Example

After creating the example_inference_lazy.py, perform the following:

  1. Set PYTHON to Python executable:

    export PYTHON=/usr/bin/python3.10
    

    Note

    The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

  2. Execute the example_inference_lazy.py:

    PT_HPU_LAZY_MODE=1 $PYTHON example_inference_lazy.py
    

    To use Lazy mode, the PT_HPU_LAZY_MODE=1 environment variable must be set since Eager mode with torch.compile is the default mode. Refer to Runtime Environment Variables for more information.

Using torch.jit.trace Mode

Load and save models in JIT trace format using torch.jit.trace mode. For further details on JIT format, refer to TORCHSCRIPT page.

  1. Create the model and move it to the HPU:

    device = torch.device('hpu')
    in_c, out_c = 3, 64
    k_size = 7
    stride = 2
    conv = torch.nn.Conv2d(in_c, out_c, kernel_size=k_size, stride=stride, bias=True)
    bn = torch.nn.BatchNorm2d(out_c, eps=0.001)
    relu = torch.nn.ReLU()
    model = torch.nn.Sequential(conv, bn, relu)
    model.eval()
    model = model.to(device)
    
  2. Save the model using torch.jit.trace:

    N, H, W = 256, 224, 224
    model_input = torch.randn((N,in_c,H,W), dtype=torch.float).to(device)
    with torch.no_grad():
       trace_model = torch.jit.trace(model, (model_input), check_trace=False, strict=False)
       # Save the HPU model with torch.jit.save.
       trace_model.save("trace_model.pt")
    
  3. Load the model using torch.jit.load:

    # Load the model directly to HPU.
    model = torch.jit.load("trace_model.pt", map_location=torch.device('hpu'))
    
  4. Create the inputs and move them to the HPU to run model inference:

    # Create inputs and move them to the HPU.
    N, H, W = 256, 224, 224
    input = torch.randn((N,in_c,H,W),dtype=torch.float)
    input_hpu = input.to(device)
    # Invoke the model.
    output = model(input_hpu)
    # In Lazy mode execution, mark_step() must be added after model inference.
    htcore.mark_step()
    

Note

JIT format is functionally correct but not yet optimized. This will be supported in a future release.