Profiling with Pytorch

Habana provides PyTorch users with near GPU experience when it comes to profiling their models. Just substitute HPU for GPU in your source code and that will take care of your migration from GPU to HPU.

The easiest way to figure out what you need to do to profile your model while training is by taking a glance at a code example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

import shutil
shutil.rmtree('runs', True)

#hpu specific
from habana_frameworks.torch.utils.library_loader import load_habana_module
load_habana_module()
habana_device = torch.device("hpu")

#general
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
model = NeuralNetwork().to(habana_device)
model.train()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)
train_dataloader = DataLoader(training_data, batch_size=64)
activities = []
activities.append(torch.profiler.ProfilerActivity.CPU)

#hpu specific
activities.append(torch.profiler.ProfilerActivity.HPU)

#general
with torch.profiler.profile(
        activities=activities,
        schedule=torch.profiler.schedule(wait=0, warmup=20, active=5, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/fashion_mnist_experiment_1/'),
        record_shapes=True,
        with_stack=True) as prof:
    for batch, (X, y) in enumerate(train_dataloader):
        X, y = X.to(habana_device), y.to(habana_device)
        pred = model(X)
        loss = loss_fn(pred, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        prof.step()

Note that in the example above, the data is collected from step 21 to 25. Take limited capacity of your buffer (which collects the data in the SynapseAI Profiling Subsystem) into consideration.

  1. Start the TensorBoard server in a dedicated terminal window:

$ tensorboard --logdir logs --bind_all --port=5990

In the example above, the listening port is set to 5990.

  1. Open new window tab in your browser and check out your TensorBoard website:

http://fq_domain_name:5990

Now you are ready to go and start your training.

The TensorBoard generates two kinds of information:

  • While your workload is being processed step by step (batch by batch), on the dashboard, you can monitor (online) the training process by tracking your model cost (loss) and accuracy.

  • Right after the last requested step was completed, the whole bunch of collected profiling data is analyzed (by TensorFlow) and submitted to your browser. No need to wait for the end of the training process.

Note

  • Carefully consider the number of steps you really need to profile and think of limited buffer size.

  • If needed, for buffer extension consult section SynapseAI Profiler User Guide.

  • For vast majority of use cases, default settings are just good enough so that no special internal parameter adjustment is needed.