Profiling with PyTorch

This section provides simple guidelines to profile your model during the training process. Additionally, it provides guidelines on how to use TensorBoard to view Intel® Gaudi® AI accelerator specific information for performance profiling. These capabilities are enabled using the torch-tb-profiler Tensorboard plugin which is included in the Intel Gaudi PyTorch installation. The below lists the performance enhancements that the plugin analyzes and provides guidance:

  • Performance:

    • Increase batch size to save graph build time and increase Gaudi HPU utilization.

    • Reduce the frequency of getting tensor values (e.g. loss prints).

    • Enable autocast mixed precision for better performance.

    • Break up graphs into smaller parts by using mark_step to trigger the execution of accumulated graphs and reduce Host-Device overlapping.

  • Dataloader optimization:

    • Includes time spent on input dataloading, and may recommend using habana_dataloader as well as how the num_workers variable may impact Dataloader time.

    • Setting non-blocking in torch.Tensor.to and pinning memory in DataLoader’s construction to asynchronously convert CPU tensor with pinned memory to an HPU tensor.

The below shows a usage example to enable TensorBoard logging in your model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
import habana_frameworks.torch.core as htcore

activities = [torch.profiler.ProfilerActivity.CPU]

#CUDA:
#device = torch.device('cuda:0')
#activities.append(torch.profiler.ProfilerActivity.CUDA)

#HPU:
device = torch.device('hpu')
activities.append(torch.profiler.ProfilerActivity.HPU)

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=0, warmup=20, active=5, repeat=1),
    activities=activities,
    on_trace_ready=torch.profiler.tensorboard_trace_handler('logs')) as profiler:
    for i in range(100):
        input = torch.tensor([[i]*10]*10, dtype=torch.float32, device=device)
        result = torch.matmul(input, input)
        result.to('cpu')
        htcore.mark_step()
        profiler.step()

Note

In the example above, the data is collected from step 21 to 25. Please consider the limited capacity of your buffer, which is responsible for collecting data within the Profiling Subsystem.

Setting up TensorBoard

  1. Start the TensorBoard server in a dedicated terminal window.

$ tensorboard --logdir logs --bind_all --port=5990

In the example above, the listening port is set to 5990.

  1. Open a new window tab in your browser and check out your TensorBoard website.

http://fq_domain_name:5990

You are now prepared to begin your training.

Two types of information are produced by TensorBoard:

  • Model Performance Tracking - While your workload is being processed in batches, you can track the progress of the training process on the dashboard in real-time by monitoring the model’s cost (loss) and accuracy.

  • Profiling Analysis - Right after the last requested step was completed, the collected profiling data is analyzed by TensorBoard and then immediately submitted to your browser, without any need to wait till the training process is completed.

Note

  • Carefully consider the size of your buffer and the number of steps you actually need to profile.

  • If you require an extension for your buffer, refer to Profiling section.

  • In most use cases, the default settings are sufficient, and there is no need for any internal parameter adjustments.

HPU Overview

When using the TensorBoard profiler, the initial view will include a comprehensive summary of the Gaudi HPU, showing both the Gaudi Device execution information as well as the Host CPU information. You will be able to see the utilization of both Host and Device and see debug guidance at the bottom of the section that can provide some guidance for performance optimization.

../_images/tensorboard_overview.jpg

HPU Kernel View

The HPU Kernel view provides specific details into the Gaudi HPU kernel, showing the utilization in the Tensor Processing Cores (TPC) and the matrix multiplication engine (MME).

../_images/Kernel_View.jpg

Memory Profiling

To monitor HPU memory during training, set the profile_memory argument to True in the torch.profiler.profile function.

The below shows a usage example:

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=0, warmup=20, active=5, repeat=1),
    activities=activities,
    on_trace_ready=torch.profiler.tensorboard_trace_handler('logs'),
    profile_memory=True) as profiler:

As a result, an additional view named “Memory View” will appear in TensorBoard.

../_images/Memory_view.jpg

If you do not want to run the TensorBoard UI, you can take the same .json log files and use the habana_perf_tool that will parse the existing .json file and provide the same recommendations for performance enhancements. Please refer to the Perf Tool and Tensorboard Model Scanning for more information.