Getting Started with Training on Intel Gaudi¶

This guide provides simple steps for preparing a PyTorch model to run training on Intel® Gaudi® AI accelerator.

Make sure to install the PyTorch packages provided by Intel Gaudi. To set up the PyTorch environment, refer to the Installation Guide.The supported PyTorch versions are listed in the Support Matrix.

Once you are ready to migrate PyTorch models that run on GPU-based architecture to run on Gaudi, you can use the GPU Migration Toolkit. The GPU Migration toolkit automates the process of migration by replacing all Python API calls that have dependencies on GPU libraries with Gaudi-specific API calls, so you can run your model with fewer modifications.

Note

Installing public PyTorch packages is supported only when using public PyTorch with Eager mode and torch.compile. For more details, see Public PyTorch Support.
Refer to the PyTorch Known Issues and Limitations section for a list of current limitations.

Creating a Simple Training Example¶

The following sections provide two training examples using Eager mode with torch.compile and Lazy mode. For more details, refer to PyTorch Gaudi Theory of Operations.

Note

For more detailed training examples, refer to the MNIST model or to the PyTorch Torchvision.

Example with Eager Mode and `torch.compile`¶

Create a file named torch_compile.py with the code below:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.optim.lr_scheduler import StepLR
import os
import sys
import habana_frameworks.torch.core as htcore

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(7744, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 3, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

def train(model, device, train_loader, optimizer, epoch):
    model.train()
    model = torch.compile(model,backend="hpu_backend")

    def train_function(data, target):
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        return loss

    training_step = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        loss = train_function(data, target)
        if batch_idx % 10 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx *
                len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

def main():
    device = torch.device("hpu")

    model = Net().to(device)

    optimizer = optim.Adadelta(model.parameters(), lr=1.0)

    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])

    dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=500)

    scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
    for epoch in range(0,1):
        train(model, device, train_loader, optimizer, epoch)
        scheduler.step()

    print("torch.compile training completed.")

if __name__ == '__main__':
    main()

This is a simple PyTorch CNN model with torch.compile enabled. The Gaudi-specific lines are explained below.

Line 9 - Import the Intel Gaudi PyTorch framework:

import habana_frameworks.torch.core as htcore

Line 38 - Wrap the model in torch.compile function and set the backend to hpu_backend:
```
model = torch.compile(model,backend="hpu_backend")
```
If you want to run the model in Eager mode without torch.compile, comment out this line.
Line 59 - Target the Gaudi device:
```
device = torch.device("hpu")
```

Executing the Example¶

After creating the torch_compile.py, perform the following:

Set PYTHON to Python executable:
```
export PYTHON=/usr/bin/python3.10
```
Note

The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.
Execute the torch_compile.py:
```
$PYTHON torch_compile.py
```

Example with Lazy Mode¶

Create a file named example.py with the code below:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
import os

# Import Habana Torch Library
import habana_frameworks.torch.core as htcore

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()

        self.fc1   = nn.Linear(784, 256)
        self.fc2   = nn.Linear(256, 64)
        self.fc3   = nn.Linear(64, 10)

    def forward(self, x):

        out = x.view(-1,28*28)
        out = F.relu(self.fc1(out))
        out = F.relu(self.fc2(out))
        out = self.fc3(out)

        return out

def train(net,criterion,optimizer,trainloader,device):

    net.train()
    train_loss = 0.0
    correct = 0
    total = 0

    for batch_idx, (data, targets) in enumerate(trainloader):

        data, targets = data.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = net(data)
        loss = criterion(outputs, targets)

        loss.backward()
        
        # API call to trigger execution
        htcore.mark_step()
        
        optimizer.step()

        # API call to trigger execution
        htcore.mark_step()

        train_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

    train_loss = train_loss/(batch_idx+1)
    train_acc = 100.0*(correct/total)
    print("Training loss is {} and training accuracy is {}".format(train_loss,train_acc))

def test(net,criterion,testloader,device):

    net.eval()
    test_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():

        for batch_idx, (data, targets) in enumerate(testloader):

            data, targets = data.to(device), targets.to(device)

            outputs = net(data)
            loss = criterion(outputs, targets)

            # API call to trigger execution
            htcore.mark_step()

            test_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

    test_loss = test_loss/(batch_idx+1)
    test_acc = 100.0*(correct/total)
    print("Testing loss is {} and testing accuracy is {}".format(test_loss,test_acc))

def main():

    epochs = 20
    batch_size = 128
    lr = 0.01
    milestones = [10,15]
    load_path = './data'
    save_path = './checkpoints'

    if(not os.path.exists(save_path)):
        os.makedirs(save_path)
    
    # Target the Gaudi HPU device
    device = torch.device("hpu")
    
    # Data
    transform = transforms.Compose([
        transforms.ToTensor(),
    ])

    trainset = torchvision.datasets.MNIST(root=load_path, train=True,
                                            download=True, transform=transform)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                            shuffle=True, num_workers=2)
    testset = torchvision.datasets.MNIST(root=load_path, train=False,
                                        download=True, transform=transform)
    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                            shuffle=False, num_workers=2)

    net = SimpleModel()
    net.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=lr,
                        momentum=0.9, weight_decay=5e-4)
    scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=milestones, gamma=0.1)

    for epoch in range(1, epochs+1):
        print("=====================================================================")
        print("Epoch : {}".format(epoch))
        train(net,criterion,optimizer,trainloader,device)
        test(net,criterion,testloader,device)

        torch.save(net.state_dict(), os.path.join(save_path,'epoch_{}.pth'.format(epoch)))

        scheduler.step()

if __name__ == '__main__':
    main()

The example.py presents a basic PyTorch code example. The Gaudi-specific lines are explained below.

Line 10 - Import habana_frameworks.torch.core:

import habana_frameworks.torch.core as htcore

Line 104 - Target the Gaudi device:
```
device = torch.device("hpu")
```
Lines 47, 52, 80 - In Lazy mode, mark_step() must be added in all training scripts right after loss.backward() and optimizer.step(). For further details on mark_step, refer to mark_step section.
```
htcore.mark_step()
```

Executing the Example¶

After creating the example.py, perform the following:

Set PYTHON to Python executable:
```
export PYTHON=/usr/bin/python3.10
```
Note

The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.
Execute the example.py:
```
PT_HPU_LAZY_MODE=1 $PYTHON example.py
```
To use Lazy mode, the PT_HPU_LAZY_MODE=1 environment variable must be set since Eager mode with torch.compile is the default mode. Refer to Runtime Environment Variables for more information.

The following should appear as part of the output:

Epoch 1/5
469/469 [==============================] - 1s 3ms/step - loss: 1.2647 - accuracy: 0.7208
Epoch 2/5
469/469 [==============================] - 1s 2ms/step - loss: 0.7113 - accuracy: 0.8433
Epoch 3/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5845 - accuracy: 0.8606
Epoch 4/5
469/469 [==============================] - 1s 2ms/step - loss: 0.5237 - accuracy: 0.8688
Epoch 5/5
469/469 [==============================] - 1s 2ms/step - loss: 0.4865 - accuracy: 0.8749
313/313 [==============================] - 1s 2ms/step - loss: 0.4482 - accuracy: 0.8869

Since the first iteration includes graph compilation time, you can see the first iteration takes longer to run than later iterations. The software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training. Typically, the graph compilation happens at the beginning of the training and at the beginning of the evaluation.

Saving Model Checkpoints with torch.save¶

When working with convolutional neural networks, using torch.save to save an HPU model’s state_dict or tensors may result in errors. This issue can occur for tensors with NCHW layout which are internally permuted on the device.

As a workaround, move the model or output tensors to CPU before calling torch.save:

# Move model to CPU before saving
torch.save(model.cpu().state_dict(), 'model_checkpoint.pth')

# Move tensor to CPU before saving
torch.save(output_tensor.cpu(), 'output_tensor.pth')

Torch Multiprocessing for DataLoaders¶

If training scripts use multiprocessing with multiple workers for PyTorch dataloader, change the start method to spawn or forkserver using the PyTorch API multiprocessing.set_start_method(...). For example:

torch.multiprocessing.set_start_method('spawn')

Default start method is fork which may result in undefined behavior.

Gaudi Documentation 1.21.1 documentation

Getting Started with Training on Intel Gaudi

On this Page

Getting Started with Training on Intel Gaudi¶

Creating a Simple Training Example¶

Example with Eager Mode and `torch.compile`¶

Executing the Example¶

Example with Lazy Mode¶

Executing the Example¶

Saving Model Checkpoints with torch.save¶

Torch Multiprocessing for DataLoaders¶

Gaudi Documentation 1.21.1 documentation

Getting Started with Training on Intel Gaudi

On this Page

Getting Started with Training on Intel Gaudi¶

Creating a Simple Training Example¶

Example with Eager Mode and torch.compile¶

Executing the Example¶

Example with Lazy Mode¶

Executing the Example¶

Saving Model Checkpoints with torch.save¶

Torch Multiprocessing for DataLoaders¶

Example with Eager Mode and `torch.compile`¶