On this Page
Enabling the Profiler¶
Enabling the profiler can be done in two modes:
Set the environment variable
Or, set the environment variable
Setting this environment variable allows the SynapseAI run-time library to
enable the profiling library during initialization. The profiling
library engages the hardware instrumentation and the application API
software instrumentation which enables API call profiling and Traces from HW by default.
It can use a configuration file from
~/.habana, or a pre-defined template.
To view a list of supported pre-defined templates:
Using either a template or a default configuration with
HABANA_PROFILE, you can merge an existing configuration on top of the default configuration or specified template in
HABANA_PROF_CONFIG. Set the environment variable
Setting this environment variable without
HABANA_PROFILE=1 loads a given configuration file.
It enables only the specified plugins in the configuration file.
For TensorFlow Keras: To ensure the profiler data is created correctly,
make sure to add the code at the end of the model
You will notice that the profiler post-processing requires some time at the end of the model
An example of configuring the profiler to capture the 1-100th enqueue:
hl-prof-config -gaudi -e off -g 1-100
-gaudi- Target architecture is Habana® Gaudi®.
-e off- Indicates that the hl-prof config file will be overwritten so that the profiler configuration will only include what is configured by this command when run.
-g 1-100- Will profile the 1-100th enqueue. Please note that the larger number of enqueues that get profiled, the longer profiler post processing will take. The profiling file will also take up more storage.
At the end of the run, you might experience long wait times for profiler to post process. If these wait times are spanning too long, try reducing the profiling span.
Effect on Performance¶
You can enable profiling for the device and/or host:
Host profiling has negligible impact on the overall application performance, and no impact on device performance.
Device profiling may add to run-time in the aspects detailed below.
Device Profiling Prolog and Epilog¶
The hardware trace components are almost completely non-intrusive. However, the enabling, disabling and collection of data adds some host CPU overhead to the overall run-time. This means that the overall time of the application can be expected to increase, although the performance of the device components will not be affected, or only slightly affected in certain scenarios.
The profiling tool utilizes a small amount of DRAM bandwidth, which can slow down topologies that depend heavily on DRAM bandwidth. The worst case is a theoretical 12.5% slowdown, and in practice 0-5% was observed, depending on the workload.
When using automatic instrumentation, the profiler is enabled and disabled for each enqueue (launch). In this case, each enqueue is executed in isolation. Therefore, certain parallelization which can be achieved by pipelining enqueues is disabled. Profiling multiple pipelined enqueues is possible using the manual instrumentation mode while surrounding the relevant user code with profiling start and stop API calls.