Profiling Workflow
On this Page
Profiling Workflow¶
Intel Gaudi provides two methods of profiling:
PyTorch based profiling using Tensorboard - A high-level profiling system where host and device information is presented as a summary:
Beginner friendly method which can be used to view the big picture before narrowing down on issues.
Offers suggestions and guidance to improve performance and basic optimizations.
Intel Gaudi Profiling subsystem - A low-level profiling method that offers fine-grained details of the Gaudi device:
Recommended for advanced users who are more familiar with Gaudi Architecture.
Useful for examining specific compute core utilization and performing time-slice analysis of each operation running on the device.
For details on high-level and low-level profiling architecture, refer Profiling Architecture.
Running Profiling¶
Run your application with high-level profiling enabled first:
Gather host and device execution and memory consumption summaries and determine if processes are host-bound or device-bound (refer HPU Overview).
Observe the recommendations suggested by the Intel Gaudi Tensorboard plugin in the Tensorboard dashboard.
Based on above recommendations, take steps to mitigate host-boundedness such as optimizing dataloader, reducing host-device memory transfers and minimizing size of accumulated graphs.
Improve performance by maximizing batch size, enabling mixed precision and other techniques mentioned here in Model Optimization Checklist.
Upload trace profile .json to https://perfetto.habana.ai and identify bottlenecks indicated by large gaps in time-slices between operations in the trace profile viewer.
Once satisfied with host level optimizations, focus on the low-level profiling using the Intel Gaudi Profiler.
Capture Intel Gaudi software API calls and hardware traces.
Upload trace profile .hltv files to https://perfetto.habana.ai and identify bottlenecks as mentioned above.
Use the in-built Trace Analyzer tool for analyzing overall duration of operations and core utilization.
For larger traces, consider using the Offline Trace Parser for detailed timeline analysis.
Refer Profiling Tips and Tricks for further understanding of each executed node.
Examples¶
To gain a sound understanding of the profiling process used in different real-world scenarios consider the Profiling Real-World Examples. To learn about profiling with a more hands-on approach, refer to the Profiling and Optimization Tutorial.