Runtime
On this Page
Runtime¶
Enabling the Profiler¶
Enabling the profiler can be done in two modes:
Set the environment variable
HABANA_PROFILE=1
with/withoutHABANA_PROF_CONFIG
:
export HABANA_PROFILE=1
Or, set the environment variable
HABANA_PROFILE=<template_name>
with/withoutHABANA_PROF_CONFIG
:
export HABANA_PROFILE=<template_name>
Setting this environment variable allows the SynapseAI run-time library to
enable the profiling library during initialization. The profiling
library engages the hardware instrumentation and the application API
software instrumentation which enables API call profiling and Traces from HW by default.
It can use a configuration file from ~/.habana
, or a pre-defined template.
To view a list of supported pre-defined templates:
hl-prof-config --list-templates
Using either a template or a default configuration with
HABANA_PROFILE
, you can merge an existing configuration on top of the default configuration or specified template inHABANA_PROFILE
withHABANA_PROF_CONFIG
. Set the environment variableHABANA_PROF_CONFIG=<prof_config.json>
only:
export HABANA_PROF_CONFIG=<prof_config.json>
Setting this environment variable without HABANA_PROFILE=1
loads a given configuration file.
It enables only the specified plugins in the configuration file.
Note
For TensorFlow Keras: To ensure the profiler data is created correctly,
make sure to add the code at the end of the model keras.backend.clear_session()
.
You will notice that the profiler post-processing requires some time at the end of the model
execution.
An example of configuring the profiler to capture the 1-100th enqueue:
hl-prof-config -gaudi -e off -g 1-100
Parameters:
-gaudi
- Target architecture is Habana® Gaudi®.-e off
- Indicates that the hl-prof config file will be overwritten so that the profiler configuration will only include what is configured by this command when run.-g 1-100
- Will profile the 1-100th enqueue. Please note that the larger number of enqueues that get profiled, the longer profiler post processing will take. The profiling file will also take up more storage.
Note
At the end of the run, you might experience long wait times for profiler to post process. If these wait times are spanning too long, try reducing the profiling span.
Effect on Performance¶
You can enable profiling for the device and/or host:
Host profiling has negligible impact on the overall application performance, and no impact on device performance.
Device profiling may add to run-time in the aspects detailed below.
Device Profiling Prolog and Epilog¶
The hardware trace components are almost completely non-intrusive. However, the enabling, disabling and collection of data adds some host CPU overhead to the overall run-time. This means that the overall time of the application can be expected to increase, although the performance of the device components will not be affected, or only slightly affected in certain scenarios.
DRAM Bandwidth¶
The profiling tool utilizes a small amount of DRAM bandwidth, which can slow down topologies that depend heavily on DRAM bandwidth. The worst case is a theoretical 12.5% slowdown, and in practice 0-5% was observed, depending on the workload.
Enqueue Pipelining¶
When using automatic instrumentation, the profiler is enabled and disabled for each enqueue (launch). In this case, each enqueue is executed in isolation. Therefore, certain parallelization which can be achieved by pipelining enqueues is disabled. Profiling multiple pipelined enqueues is possible using the manual instrumentation mode while surrounding the relevant user code with profiling start and stop API calls.