Profiling Workflow

Intel Gaudi provides two methods of profiling:

  • PyTorch based profiling using Tensorboard - A high-level profiling system where host and device information is presented as a summary:

    • Beginner friendly method which can be used to view the big picture before narrowing down on issues.

    • Offers suggestions and guidance to improve performance and basic optimizations.

  • Intel Gaudi Profiling subsystem - A low-level profiling method that offers fine-grained details of the Gaudi device:

    • Recommended for advanced users who are more familiar with Gaudi Architecture.

    • Useful for examining specific compute core utilization and performing time-slice analysis of each operation running on the device.

For details on high-level and low-level profiling architecture, refer Profiling Architecture.

Running Profiling

  • Run your application with high-level profiling enabled first:

    • Gather host and device execution and memory consumption summaries and determine if processes are host-bound or device-bound (refer HPU Overview).

    • Observe the recommendations suggested by the Intel Gaudi Tensorboard plugin in the Tensorboard dashboard.

    • Based on above recommendations, take steps to mitigate host-boundedness such as optimizing dataloader, reducing host-device memory transfers and minimizing size of accumulated graphs.

    • Improve performance by maximizing batch size, enabling mixed precision and other techniques mentioned here in Model Optimization Checklist.

    • Upload trace profile .json to https://perfetto.habana.ai and identify bottlenecks indicated by large gaps in time-slices between operations in the trace profile viewer.

  • Once satisfied with host level optimizations, focus on the low-level profiling using the Intel Gaudi Profiler.

    • Capture Intel Gaudi software API calls and hardware traces.

    • Upload trace profile .hltv files to https://perfetto.habana.ai and identify bottlenecks as mentioned above.

    • Use the in-built Trace Analyzer tool for analyzing overall duration of operations and core utilization.

    • For larger traces, consider using the Offline Trace Parser for detailed timeline analysis.

    • Refer Profiling Tips and Tricks for further understanding of each executed node.

Examples

To gain a sound understanding of the profiling process used in different real-world scenarios consider the Profiling Real-World Examples. To learn about profiling with a more hands-on approach, refer to the Profiling and Optimization Tutorial.