5. Profiler User Guide

5.1. Overview

This document describes the SynapseAI Profiling Subsystem included in the SynapseAI software release.

The Synapse Profiling Subsystem is designed to facilitate the instrumentation of Habana hardware and software systems. The subsystem generates diagnostic information of core utilization, enabling performance analysis and optimization.

The profiler functions in three stages:

  • Configuration

  • Run-time

  • Analysis

This document provides a detailed description of the operation of each stage.

5.2. Synapse Profiling Subsystem

5.2.1. Configuration

No configuration is required when using the profiling default settings. To configure the settings, a profiling configuration tool is included in the SynapseAI installation. Pre-configured Instrumentation

To configure the profiler settings run the following application:


This tool functions in two modes: command line interface (CLI) and graphical user interface (GUI). Using the GUI requires Java. The tool enables adjusting the software and hardware settings of the profiling subsystem, including changing the session name, output directory, output formats and basic hardware settings of the instrumentation.

To use the Java GUI, simply invoke the tool without any parameters. When parameters are added to the call, the tool operates in CLI mode. For example:

hl-prof-config -h

Th above displays the usage message in the terminal. To see architecture specific settings, add the desired architecture. For example:

hl-prof-config -h -gaudi

The configuration file is stored in a hidden folder called '.habana' located in your home directory. All subsequent profiling sessions will use these settings. The settings can be reset using the tool, or by deleting the configuration.json files located in the '.habana' directory.


Figure 5.1 Profiler Configuration Tool Main Window Source Code Instrumentation

For application level activation of the profiler in user code, select API controlled option in the main window. This can be set separately for device instrumentation and host instrumentation.

For device instrumentation, select the API controlled option in the General settings-> Phase settings of the Select Trace Units window (see Figure 5.2). This disables all automatic profiling of API calls. After saving this configuration, calls to the synProfilerStart() and synProfilerStop() APIs should be placed in the relevant places in the code. Each time a pair of start and stop calls are encountered, the profiling subsystem generates a new trace buffer. The trace buffer may then be retrieved from memory using the synProfilerGetTrace() API. The trace buffer will be available for retrieval until a new one is generated. In this context, the other phases and max invocations are irrelevant. All other settings apply both to manual and automatic profiling. Runtime usage is also the same for both, as detailed in the following section.

The trace buffer may be retrieved in memory as a C/C++ struct, and parsed in the application, or written to the file system in JSON format. If using the latter, the file will be saved according to the output settings. Further details are available in the SynapseAI API documentation.

The below is an example of using an API controlled instrumentation:

// start profiling (enables device trace modules)

status = **synProfilerStart**\ (synTraceDevice, deviceId);

// do something here e.g., enqueue

// ...

// wait on some handle for completion

// stop profiling (disables the device trace modules)

status = **synProfilerStop**\ (synTraceDevice, deviceId);

// output to file system by passing buffer=nullptr & size=nullptr

| status = **synProfilerGetTrace**\ (synTraceDevice,
| deviceId, synTraceFormatTEF, nullptr, nullptr);

Figure 5.2 Profiler Configuration Tool - Select Trace Units Window

5.2.2. Run-time Usage

To enable the profiler, run the following:

**export HABANA_PROFILE=1**

Setting this environment variable allows the Synapse run-time library to enable the profiling library during initialization. The profiling library engages the hardware instrumentation and the application API software instrumentation.

Note for TensorFlow Keras usage: To ensure that the profiler data is created correctly, make sure to add the below line of code at the end of the model. You will notice that the profiler post-processing will take some time at the end of the model execution.


Additionally, a simple example of configuring the profiler to capture the 1-100th enqueue is as follows:

hl-prof-config -gaudi -e off -g 1-100


  • -gaudi - Target architecture is Habana Gaudi.

  • -e off - Indicates that the hl-prof config file will be overwritten so that the profiler configuration will only include what is configured by this command when run.

  • -g 1-100 - Will profile the 1-100th enqueue. Please note that the larger number of enqueues that get profiled, the longer profiler post processing will take. The profiling file will also take up more storage.


At the end of the run, you might experience long wait times for profiler to post process. If these wait times are spanning too long, try reducing the profiling span. Effect on Performance

You can enable profiling for the device and/or host:

  • Host profiling has negligible impact on the overall application performance, and no impact on device performance.

  • Device profiling may add to run-time in the aspects detailed below. Device Profiling Prolog and Epilog

The hardware trace components are almost completely non-intrusive. However, the enabling, disabling and collection of data adds some host CPU overhead to the overall run-time. This means that the overall time of the application can be expected to increase, although the performance of the device components will not be affected, or only slightly affected in certain scenarios. DRAM Bandwidth

The profiling tool utilizes a small amount of DRAM bandwidth, which can slow down topologies that depend heavily on DRAM bandwidth. The worst case is a theoretical 12.5% slowdown, and in practice 0-5% was observed, depending on the workload. Enqueue Pipelining

When using automatic instrumentation, the profiler is enabled and disabled for each enqueue (launch). In this case, each enqueue is executed in isolation. Therefore, certain parallelization which can be achieved by pipelining enqueues is disabled. Profiling multiple pipelined enqueues is possible using the manual instrumentation mode while surrounding the relevant user code with profiling start and stop API calls.

5.2.3. Analysis Output Products

The default profiler output file is default_profiling.json. This is a parsed JSON file of device trace and host API function calls, which can be viewed in the HLTV viewer. The profiler’s output files which are not written by default are:

  • default_profiling_[<serial#>].json - Per-synLaunch(enqueue) parsed JSON file of device trace, for viewing in the HLTV viewer. Files per synLaunch(enqueue) are generated in case host profiling is disabled.

  • default_profiling_host.json - JSON representation of the API function calls for host application profiling, in case device profiling is disabled.


Full output file name format is: <sessionName>[_<timestamp>][_deviceId#][_<serial#>]

  • sessionName - Session name is default_profiling, unless otherwise configured.

  • _<timestamp> - Timestamp appears if the character ‘#’ is included in the session name. Timestamp format is YYYYMMDD_hh-mm-ss.

  • _deviceId# - The device ID is included if the device profiled is not identified as hl0, e.g. hl1, hl2, hl3 etc. In the case of device 0, the deviceId will not be included in the output filename.

  • _<serial#> - Serial number is the number of the invocation in the current session. By default, profiling is enabled for the first two Synapse synLaunch API calls committed by the application. Subsequent calls will not be traced. Viewing Instructions

To view the profiling graph:

  1. Open Google Chrome or other Chromium Web Browser.

  2. Type https://hltv.habana.ai in the address bar.

  3. Drag and drop or load the generated JSON file.

HLTV (Habana Labs Trace Viewer) is a web service based on the chrome://tracing mechanism but with specific functionality added for Habana Labs tracing. The trace data is rendered inside HLTV on the client side, so no data is uploaded to the web server. It is also possible to install it as a PWA (Progressive Web App) on the client by pressing the small installation icon in the browser’s address bar.

Using the default configuration, the profiling results are divided into three processes: DMA, MME and TPC. The DMA shows bus monitors, while the MME and TPC show contexts. Together, this data provides a view to the execution of a recipe on the hardware and enables the viewer to quickly ascertain the cause of bottlenecks, slow performance, etc.

The DMA contains results from six bus monitors: DDR0 read and write, DDR1 read and write, and SRAM read and write. Each of the six bus monitors track bandwidth (in percentage units), latency (in cycle units), and outstanding transactions (in number of transactions) counters. Each counter shows the minimum, average and maximum values for the monitored time window. By default, the window is set to 2000 cycles.

The MME and TPC show workloads based on timestamped hardware events demarcating the beginning and end of each context. Clicking on a context shows additional information regarding the context, including the user node name, the operation kernel name, and the data type. Figure 5.3 shows an example of a topology view in chrome://tracing. Figure 5.4 shows an example of the host API calls view in chrome://tracing.


Figure 5.3 Full Application View with Multiple Profiled Iterations


Figure 5.4 Zoom in on Device Profiling View

The graphical interface is powered by the Trace Event Profiling Tool Chromium Project and the Trace-Viewer frontend for Chrome.

The viewing features are clearly documented and accessible by clicking the question mark in the top right corner. Figure 5.5 shows a snapshot of the help screen.


Figure 5.5 Chrome Tracing Help Screen

One of the most useful viewing tools is the Timing Mode, enabled by pressing ‘4’. This mode allows selection by dragging the mouse from one point to another and then displays the exact time between the beginning and end of selection. Figure 5.6 shows an example of a time slice selection.


Figure 5.6 Timing Selection Example Trace Analyzer

The trace analyzer is a built-in feature in HLTV that is meant to reduce the amount of time spent on analyzing large traces. In the bottom panel, a tab called “Trace Analyzer” contains aggregate data per operation including total duration, MME utilization and additional information. Double-clicking on a specific row switches to the “Analyzed Nodes” tab and filters it for the chosen operation.


The “Analyzed Nodes” tab contains additional information for each node in the executed graph. A filter option in the top left corner of the tab is available and can filter rows by node name as well as by operation. It is also possible to control the order of the columns through a drag & drop mechanism located on the left side of the tab.