Profiling with PyTorch

This section provides simple guidelines to profile your model during the training process. Additionally, it provides guidelines on how to use TensorBoard to view Intel® Gaudi® AI accelerator specific information for performance profiling. These capabilities are enabled using the torch-tb-profiler TensorBoard plugin which is included in the Intel Gaudi PyTorch package. To install the package, see Installation Guide and On-Premise System Update. The below table lists the performance enhancements that the plugin analyzes and provides guidance for:

Component

Description

Performance

  • Increase batch size to save graph build time and increase HPU utilization.

  • Reduce the frequency of getting tensor values (e.g. loss prints).

  • Enable autocast mixed precision for better performance.

  • Break up graphs into smaller parts by using mark_step to trigger the execution of accumulated graphs and reduce Host-Device overlapping.

Dataloader optimization

  • Includes time spent on input dataloading, and may recommend using habana_dataloader as well as how the num_workers variable may impact Dataloader time.

  • Setting non-blocking in torch.Tensor.to and pinning memory in DataLoader’s construction to asynchronously convert CPU tensor with pinned memory to an HPU tensor.

Setting up TensorBoard

  1. Enable TensorBoard logging in your model as shown in the example below. For demo purposes and to examine the trace file, you may run this code as a standalone Python script by pasting it in a Python file (e.g. profiling.py) and running python profiling.py:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    import torch
    import habana_frameworks.torch.core as htcore
    
    activities = [torch.profiler.ProfilerActivity.CPU]
    
    #CUDA:
    #device = torch.device('cuda:0')
    #activities.append(torch.profiler.ProfilerActivity.CUDA)
    
    #HPU:
    device = torch.device('hpu')
    activities.append(torch.profiler.ProfilerActivity.HPU)
    
    with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=0, warmup=20, active=5, repeat=1),
        activities=activities,
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./profile_logs')) as profiler:
        for i in range(100):
            input = torch.tensor([[i]*10]*10, dtype=torch.float32, device=device)
            result = torch.matmul(input, input)
            result.to('cpu')
            htcore.mark_step()
            profiler.step()
    

    Note

    In the example above, the data is collected from steps 21 to 25 and trace files are saved in the profile_logs folder. Make sure to consider the limited capacity of your buffer, which is responsible for collecting data within the Profiling subsystem.

  2. Start the TensorBoard server in a dedicated terminal window. In the example below, the listening port is set to 5990 and the log directory is pointed to the ./profile_logs folder:

    $ tensorboard --logdir ./profile_logs --bind_all --port=5990
    
  3. Open a new window tab in your browser and check out your TensorBoard website:

    http://fq_domain_name:5990
    

    OR .. code:

    http://localhost:5990
    

You are now prepared to begin your training.

Viewing TensorBoard

The following types of information are produced by TensorBoard:

  • Model performance tracking - While your workload is being processed in batches, you can track the progress of the training process on the dashboard in real-time by monitoring the model’s cost (loss) and accuracy.

  • Profiling analysis - Right after the last requested step is completed, the collected profiling data is analyzed by TensorBoard and then immediately submitted to your browser, without any need to wait until the training process is completed.

Note

  • Carefully consider the size of your buffer and the number of steps you actually need to profile.

  • If you require an extension for your buffer, refer to Profiling with Intel Gaudi Software section.

  • In most use cases, the default settings are sufficient, and there is no need for any internal parameter adjustments.

HPU Overview

The initial TensorBoard profiling view includes a comprehensive summary of Gaudi, showing both the Gaudi device execution information as well as the Host CPU information. You will be able to see the utilization of both host and device as well as debug guidance at the bottom of the section that can provide some guidance for performance optimization.

../_images/tensorboard_overview.jpg

HPU Kernel View

The HPU kernel view provides specific details, showing HPU kernel utilization in the Tensor Processing Cores (TPC) and the matrix multiplication engine (MME).

../_images/Kernel_View.jpg

Memory Profiling

To monitor HPU memory during training, set the profile_memory argument to True in the torch.profiler.profile function. The below shows a usage example:

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=0, warmup=20, active=5, repeat=1),
    activities=activities,
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./profile_logs'),
    profile_memory=True) as profiler:

As a result, an additional view named “Memory View” will appear in TensorBoard.

../_images/Memory_view.jpg

Perf Tool

If you do not want to run the TensorBoard UI, you can take the same JSON log files and use the habana_perf_tool. The tool parses the existing JSON file and provides the same recommendations for performance enhancements. For more details, see Perf Tool and TensorBoard Model Scanning.