Profiling with PyTorch
On this Page
Profiling with PyTorch¶
This section provides simple guidelines to profile your model during the training process. Additionally, it provides guidelines on how to use TensorBoard to view Gaudi specific
information for performance profiling. These capabilities are enabled using the
torch-tb-profiler Tensorboard plugin which is included in the SynapseAI Software stack.
The below lists the performance enhancements that the plugin analyzes and provides guidance:
Increase batch size to save graph build time and increase Gaudi HPU utilization.
Reduce the frequency of getting tensor values (e.g. loss prints).
Enable autocast mixed precision for better performance.
Break up graphs into smaller parts by using
mark_stepto trigger the execution of accumulated graphs and reduce Host-Device overlapping.
Includes time spent on input dataloading, and may recommend using Habana’s DataLoader as well as how the num_workers variable may impact Dataloader time.
Setting non-blocking in
torch.Tensor.toand pinning memory in DataLoader’s construction to asynchronously convert CPU tensor with pinned memory to an HPU tensor.
The below shows a usage example to enable TensorBoard logging in your model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
import torch import habana_frameworks.torch.core as htcore activities = [torch.profiler.ProfilerActivity.CPU] #CUDA: #device = torch.device('cuda:0') #activities.append(torch.profiler.ProfilerActivity.CUDA) #HPU: device = torch.device('hpu') activities.append(torch.profiler.ProfilerActivity.HPU) with torch.profiler.profile( schedule=torch.profiler.schedule(wait=0, warmup=20, active=5, repeat=1), activities=activities, on_trace_ready=torch.profiler.tensorboard_trace_handler('logs')) as profiler: for i in range(100): input = torch.tensor([[i]*10]*10, dtype=torch.float32, device=device) result = torch.matmul(input, input) result.to('cpu') htcore.mark_step() profiler.step()
In the example above, the data is collected from step 21 to 25. Please consider the limited capacity of your buffer, which is responsible for collecting data within the SynapseAI Profiling Subsystem.
Setting up TensorBoard¶
Start the TensorBoard server in a dedicated terminal window.
$ tensorboard --logdir logs --bind_all --port=5990
In the example above, the listening port is set to 5990.
Open a new window tab in your browser and check out your TensorBoard website.
You are now prepared to begin your training.
Two types of information are produced by TensorBoard:
Model Performance Tracking - While your workload is being processed in batches, you can track the progress of the training process on the dashboard in real-time by monitoring the model’s cost (loss) and accuracy.
Profiling Analysis - Right after the last requested step was completed, the collected profiling data is analyzed by TensorBoard and then immediately submitted to your browser, without any need to wait till the training process is completed.
Carefully consider the size of your buffer and the number of steps you actually need to profile.
If you require an extension for your buffer, refer to SynapseAI Profiler User Guide section.
In most use cases, the default settings are sufficient, and there is no need for any internal parameter adjustments.
When using the TensorBoard profiler, the initial view will include a comprehensive summary of the Gaudi HPU, showing both the Gaudi Device execution information as well as the Host CPU information. You will be able to see the utilization of both Host and Device and see debug guidance at the bottom of the section that can provide some guidance for performance optimization.
HPU Kernel View¶
The HPU Kernel view provides specific details into the Gaudi HPU kernel, showing the utilization in the Tensor Processing Cores (TPC) and the matrix multiplication engine (MME).
To monitor HPU memory during training, set the
profile_memory argument to
True in the
The below shows a usage example:
with torch.profiler.profile( schedule=torch.profiler.schedule(wait=0, warmup=20, active=5, repeat=1), activities=activities, on_trace_ready=torch.profiler.tensorboard_trace_handler('logs'), profile_memory=True) as profiler:
As a result, an additional view named “Memory View” will appear in TensorBoard.
If you do not want to run the TensorBoard UI, you can take the same .json log files and use the
habana_perf_tool that will parse the existing .json
file and provide the same recommendations for performance enhancements. Please refer to the Habana Perf Tool for more information.