PyTorch Lightning User Guide

The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana® Gaudi® infrastructure using a simple PyTorch Lightning interface. It provides guidelines for modifying existing models to run on the platform and uses a basic example to show functionality.

PyTorch Lightning Gaudi Integration Architecture

The PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to enable the execution of deep learning models on the Habana Gaudi device.

PyTorch Lightning wraps PyTorch and the associated Habana bridge. Models can be built on top of Lightning interface without the need to add more glue code specific to the backend .

The installation package provided by Habana comes with modifications on top of the PyTorch and PyTorch Lightning releases. PyTorch deep learning model training scripts need to load the PyTorch Habana plugin library and import habana_frameworks.torch.core module to integrate with Habana bridge.

Further integration details can be found in PyTorch Lightning Examples section.

PyTorch Lightning makes use of the default features supported by the Habana bridge. By default, the model executes in Lazy mode. For further details, please refer to the PyTorch User Guide.

The Habana PyTorchLightning package is released as part of the SynapseAI release docker images with the name pytorch-lightning and its associated version as detailed in PyTorch Installation.

PyTorch Lightning Plugins

This section describes the plugins needed by the trainer class of PyTorch Lightning to make use of Habana backend. The HPU customized PyTorch Lightning wheel supports trainer type plugin with distributed training support and mixed precision.

../_images/PyTorchLightning_with_HPU.png

Figure 12 PyTorch Lightning with Habana Backend

Training Type Plugin

Habana accelerator support is enabled through single_device plugin of PyTorch Lightning. All backend specific glue code has been integrated in this plugin.

The hpus=1 parameter in the trainer class enables the Habana backend for single card training.

Mixed Precision Plugin

The precision=16 and a hmp_params parameter in the trainer class enables the Habana plugin for mixed precision

Habana PyTorch Lightning supports mixed precision plugin and execution using the Habana Mixed Precision (HMP) package. You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution.

Refer to PyTorch Mixed Precision Training on Gaudi for more details.

Distributed Training Plugin

The hpus=8 parameter in the trainer class enables the Habana backend for distributed training with 8 cards.

Habana PyTorch Lightning implements DDP plugin with HCCL communication backend to support scale-up within a node and scale-out across multiple nodes.

Please refer Distributed Training with PyTorch for more details.

Habana Data Loader Plugin

The datamodule plugin enabling Habana data loader is currently not provided with PyTorch Lightning.

PyTorch Lightning Examples

This section describes how to train models using Habana PyTorch with Gaudi.

Run Lightning Models in Habana Model References

  1. Make sure the drivers and firmware required for Gaudi are installed. See Installation Guide. For Docker setup and installation details, refer to the Installation Guide.

  2. Clone the models located in the Model References GitHub repository using Git clone.

  3. Follow the README instructions in PyTorch Model Reference GitHub page.

Note

Only Unet2D and Unet3D are supported using PyTorchLightning.

Porting a PyTorch Lightning Model to Gaudi

The below simple_example.py shows an example model using MNIST with habana backend. The highlighted lines of code are added for PyTorch Lightning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import os

import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl
import sys

import habana_frameworks.torch.core as htcore

class LitClassifier(pl.LightningModule):

    def __init__(self):
        super(LitClassifier, self).__init__()

        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        probs = self(x)
        acc = self.accuracy(probs, y)
        return acc

    def test_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        acc = self.accuracy(logits, y)
        return acc

    def accuracy(self, logits, y):
        acc = torch.sum(torch.eq(torch.argmax(logits, -1), y).to(torch.float32)) / len(y)
        return acc

    def validation_epoch_end(self, outputs) -> None:
        self.log("val_acc", torch.stack(outputs).mean(), prog_bar=True)

    def test_epoch_end(self, outputs) -> None:
        self.log("test_acc", torch.stack(outputs).mean())

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

# Init our model
model = LitClassifier()

# Init DataLoader from MNIST Dataset
train_ds = MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor())
val_ds = MNIST(os.getcwd(), train=False, transform=transforms.ToTensor())
train_loader = DataLoader(train_ds, batch_size=32)
val_loader = DataLoader(val_ds, batch_size=16)

hmp_keys = ["level", "verbose", "bf16_ops", "fp32_ops"]
hmp_params = dict.fromkeys(hmp_keys)
hmp_params["level"] = "O1"
hmp_params["verbose"] = False
hmp_params["bf16_ops"] = "./ops_bf16_mnist.txt"
hmp_params["fp32_ops"] = "./ops_fp32_mnist.txt"

# Initialize a trainer
trainer = pl.Trainer(hpus=1, max_epochs=1, precision=16, hmp_params=hmp_params)

# Train the model ⚡
trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)
trainer.test(model, val_loader)
trainer.validate(model, dataloaders=val_loader)

Note

  • Only Unet2D and Unet3D has been validated with the released PyTorchLightning package version 1.5.10.

  • Please refer to the PyTorch Known Issues and Limitations section for a list of current limitations.