Theme by the Executable Book Project

On this Page

Profiling Workflow

On this Page

Profiling Workflow¶

Intel Gaudi provides two methods of profiling:

PyTorch based profiling using Tensorboard - A high-level profiling system where host and device information is presented as a summary:
- Beginner friendly method which can be used to view the big picture before narrowing down on issues.
- Offers suggestions and guidance to improve performance and basic optimizations.
Intel Gaudi Profiling subsystem - A low-level profiling method that offers fine-grained details of the Gaudi device:
- Recommended for advanced users who are more familiar with Gaudi Architecture.
- Useful for examining specific compute core utilization and performing time-slice analysis of each operation running on the device.

For details on high-level and low-level profiling architecture, refer Profiling Architecture.

Running Profiling¶

Run your application with high-level profiling enabled first:
- Gather host and device execution and memory consumption summaries and determine if processes are host-bound or device-bound (refer HPU Overview).
- Observe the recommendations suggested by the Intel Gaudi Tensorboard plugin in the Tensorboard dashboard.
- Based on above recommendations, take steps to mitigate host-boundedness such as optimizing dataloader, reducing host-device memory transfers and minimizing size of accumulated graphs.
- Improve performance by maximizing batch size, enabling mixed precision and other techniques mentioned here in Model Optimization Checklist.
- Upload trace profile .json to https://perfetto.habana.ai and identify bottlenecks indicated by large gaps in time-slices between operations in the trace profile viewer.
Once satisfied with host level optimizations, focus on the low-level profiling using the Intel Gaudi Profiler.
- Capture Intel Gaudi software API calls and hardware traces.
- Upload trace profile .hltv files to https://perfetto.habana.ai and identify bottlenecks as mentioned above.
- Use the in-built Trace Analyzer tool for analyzing overall duration of operations and core utilization.
- For larger traces, consider using the Offline Trace Parser for detailed timeline analysis.
- Refer Profiling Tips and Tricks for further understanding of each executed node.

Examples¶

To gain a sound understanding of the profiling process used in different real-world scenarios consider the Profiling Real-World Examples. To learn about profiling with a more hands-on approach, refer to the Profiling and Optimization Tutorial.