Profiling Workflow
On this Page
Profiling Workflow¶
Intel Gaudi provides two methods of profiling:
- PyTorch based profiling using Tensorboard - A high-level profiling system where host and device information is presented as a summary: - Beginner friendly method which can be used to view the big picture before narrowing down on issues. 
- Offers suggestions and guidance to improve performance and basic optimizations. 
 
- Intel Gaudi Profiling subsystem - A low-level profiling method that offers fine-grained details of the Gaudi device: - Recommended for advanced users who are more familiar with Gaudi Architecture. 
- Useful for examining specific compute core utilization and performing time-slice analysis of each operation running on the device. 
 
For details on high-level and low-level profiling architecture, refer Profiling Architecture.
Running Profiling¶
- Run your application with high-level profiling enabled first: - Gather host and device execution and memory consumption summaries and determine if processes are host-bound or device-bound (refer HPU Overview). 
- Observe the recommendations suggested by the Intel Gaudi Tensorboard plugin in the Tensorboard dashboard. 
- Based on above recommendations, take steps to mitigate host-boundedness such as optimizing dataloader, reducing host-device memory transfers and minimizing size of accumulated graphs. 
- Improve performance by maximizing batch size, enabling mixed precision and other techniques mentioned here in Model Optimization Checklist. 
- Upload trace profile .json to https://perfetto.habana.ai and identify bottlenecks indicated by large gaps in time-slices between operations in the trace profile viewer. 
 
- Once satisfied with host level optimizations, focus on the low-level profiling using the Intel Gaudi Profiler. - Capture Intel Gaudi software API calls and hardware traces. 
- Upload trace profile .hltv files to https://perfetto.habana.ai and identify bottlenecks as mentioned above. 
- Use the in-built Trace Analyzer tool for analyzing overall duration of operations and core utilization. 
- For larger traces, consider using the Offline Trace Parser for detailed timeline analysis. 
- Refer Profiling Tips and Tricks for further understanding of each executed node. 
 
Examples¶
To gain a sound understanding of the profiling process used in different real-world scenarios consider the Profiling Real-World Examples. To learn about profiling with a more hands-on approach, refer to the Profiling and Optimization Tutorial.
