Runtime

Enabling the Profiler

Enabling the profiler can be done in two modes:

  • Set the environment variable HABANA_PROFILE=1 with/without HABANA_PROF_CONFIG:

export HABANA_PROFILE=1
  • Or, set the environment variable HABANA_PROFILE=<template_name> with/without HABANA_PROF_CONFIG:

export HABANA_PROFILE=<template_name>

Setting this environment variable allows the SynapseAI run-time library to enable the profiling library during initialization. The profiling library engages the hardware instrumentation and the application API software instrumentation which enables API call profiling and Traces from HW by default. It can use a configuration file from ~/.habana, or a pre-defined template.

  • To view a list of supported pre-defined templates:

hl-prof-config --list-templates
  • Using either a template or a default configuration with HABANA_PROFILE, you can merge an existing configuration on top of the default configuration or specified template in HABANA_PROFILE with HABANA_PROF_CONFIG. Set the environment variable HABANA_PROF_CONFIG=<prof_config.json> only:

export HABANA_PROF_CONFIG=<prof_config.json>

Setting this environment variable without HABANA_PROFILE=1 loads a given configuration file. It enables only the specified plugins in the configuration file.

Note

For TensorFlow Keras: To ensure the profiler data is created correctly, make sure to add the code at the end of the model keras.backend.clear_session(). You will notice that the profiler post-processing requires some time at the end of the model execution.

An example of configuring the profiler to capture the 1-100th enqueue:

hl-prof-config -gaudi -e off -g 1-100

Parameters:

  • -gaudi - Target architecture is Habana® Gaudi®.

  • -e off - Indicates that the hl-prof config file will be overwritten so that the profiler configuration will only include what is configured by this command when run.

  • -g 1-100 - Will profile the 1-100th enqueue. Please note that the larger number of enqueues that get profiled, the longer profiler post processing will take. The profiling file will also take up more storage.

Note

At the end of the run, you might experience long wait times for profiler to post process. If these wait times are spanning too long, try reducing the profiling span.

Effect on Performance

You can enable profiling for the device and/or host:

  • Host profiling has negligible impact on the overall application performance, and no impact on device performance.

  • Device profiling may add to run-time in the aspects detailed below.

Device Profiling Prolog and Epilog

The hardware trace components are almost completely non-intrusive. However, the enabling, disabling and collection of data adds some host CPU overhead to the overall run-time. This means that the overall time of the application can be expected to increase, although the performance of the device components will not be affected, or only slightly affected in certain scenarios.

DRAM Bandwidth

The profiling tool utilizes a small amount of DRAM bandwidth, which can slow down topologies that depend heavily on DRAM bandwidth. The worst case is a theoretical 12.5% slowdown, and in practice 0-5% was observed, depending on the workload.

Enqueue Pipelining

When using automatic instrumentation, the profiler is enabled and disabled for each enqueue (launch). In this case, each enqueue is executed in isolation. Therefore, certain parallelization which can be achieved by pipelining enqueues is disabled. Profiling multiple pipelined enqueues is possible using the manual instrumentation mode while surrounding the relevant user code with profiling start and stop API calls.