# 7. Debugging Guide¶

## 7.1. General Recommendations¶

Below are suggested courses of action to try if you encounter various issues while training on Habana® Gaudi®. These can be tried in combination with data science best practices.

Note

Habana’s integration with TensorFlow does not currently support eager mode. Eager mode functions such as tf.config.experimental_run_functions_eagerly will not function as expected.

Note

Habana’s integration with Pytorch supports eager mode and is the default execution mode.

## 7.2. TensorBoard Usage¶

### 7.2.1. Visualization¶

SynapseAI® Software can generate data representing HPU clusters to be visualized by TensorBoard. When TensorBoard visualization is enabled, SynapseAI adds a tag, post_optimization_graph, visualizing the clustered TF graph. Furthermore, if the environment variable GRAPH_VISUALIZATION=1, additional tags will be created for each Habana Op cluster, visualizing the cluster’s pre and post Synapse graphs with respect to the Graph Compiler’s graph compilation.

### 7.2.2. Trace viewer¶

SynapseAI® Profiling Subsystem trace buffer (see Profiler User Guide) can be displayed in TensorBoard Profile’s trace viewer tool. When TensorBoard profiling is enabled, if the environment variable HABANA_PROFILE=1, SynapseAI® profiling library engages the hardware instrumentation and the generated trace buffer will be displayed in the trace viewer.

Due to TensorBoard limitations, the HPU device is currently displayed as TPU.

## 7.3. Generate Logs¶

If you encounter problems while training a model on Gaudi, it is frequently useful to generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.

The generation of logging information and the location of logged information is controlled by environment variables. For example, if you set the following environment variables before training your model, a large amount of information will be generated under ~/.habana_logs/:

$export HABANA_LOGS=~/.habana_logs$ export LOG_LEVEL_ALL=0
$# Train your model as usual  The below details the various environment variables and the description of their values. ### 7.3.1. Location of Log Files¶ ENABLE_CONSOLE=true outputs the logs to the console. If ENABLE_CONSOLE is not set at all or not set to true, logs are output in the directory specified by HABANA_LOGS. For example, if you set the following environment variables, all SynapseAI errors will be logged to the console: $ export ENABLE_CONSOLE=true
$export LOG_LEVEL_ALL=4$ # Train your model as usual


### 7.3.2. Log Levels¶

 0 Trace Log everything including traces of progress 1 Debug Log all errors, warnings and all information useful for debugging 2 Info Log errors, warnings and some informative messages 3 Warning Log all errors and warnings 4 Error Log all errors 5 Critical Log only critical errors 6 Off Log nothing

### 7.3.3. Component-Level Logs¶

The value of LOG_LEVEL_ALL=[log level] sets the logging level for all components. However, it is sometimes useful to view detailed information for a single component.

To specify the log level for a particular component, append the name of the component to LOG_LEVEL_.

For example, if you set the following environment variable, all components will log only critical errors (set with LOG_LEVEL_ALL=5) except for the Synapse API (set with LOG_LEVEL_SYN_API=3), which will log all errors and warnings:

$export HABANA_LOGS=~/.habana_logs$ export LOG_LEVEL_ALL=5
$export LOG_LEVEL_SYN_API=3$ # Train your model as usual


### 7.3.4. Names of Components that Produce Logs¶

 The Synapse API SYN_API The profiling subsystem SYN_PROF, PROF_hl[0-7] and HLPROF The graph compiler PARSER, GC, and GRAPH_DATA The Habana performance library PERF_LIB The Habana Communication Library HCL and HCL_SUBMISSIONS

## 7.4. Framework/Bridge Logs¶

### 7.4.1. TensorFlow¶

You can set the following environment variables to obtain TensorFlow Habana Bridge level logs:

$export TF_CPP_MIN_LOG_LEVEL=xxxx$ export TF_CPP_MIN_VLOG_LEVEL=yyyy


Please refer to the Runtime Environment Variables section for a description of the above environment variables.

### 7.4.2. PyTorch¶

You can set the following environment variables to obtain PyTorch Habana Bridge level logs:

$export PT_HPU_LOG_MOD_MASK=xxxx$ export PT_HPU_LOG_TYPE_MASK=yyyy


Please refer to the Runtime Flags section for a description of the above environment variables.

## 7.5. Log TensorFlow Operation Assignments¶

tf.debugging.set_log_device_placement(True) prints assignments of tensors and operations to devices.

## 7.6. Move TensorFlow Operators¶

Under certain circumstances, a Habana operator will not support a tensor having the given shape or data type. If the tensor cannot be reshaped or cast to a supported type, for TensorFlow models, it is useful to use the tf.device method to schedule the operator for execution on the CPU.

## 7.7. Let TensorFlow Choose the Device¶

Adding tf.config.set_soft_device_placement(True) may prevent some compilation errors.

## 7.8. TPC Fuser¶

The TPC Fuser is an optimizing compiler for TPC kernels. The TPC Fuser can be enabled by exporting the environment variable RUN_TPC_FUSER=true.

$export RUN_TPC_FUSER=true$ # Train your model as usual


Enabling the TPC Fuser can result in numerical issues like NaN loss. Under such circumstances, it can be helpful to turn off the TPC Fuser.

$export RUN_TPC_FUSER=false$ # Train your model as usual


## 7.10. Other Data Types¶

If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.

### 7.10.1. PyTorch¶

Once FP32 based model converges, you may want to experiment with different mixed precision configurations to arrive at a model with optimal performance/accuracy benefits. Please refer to the PyTorch Mixed Precision Training on Gaudi section for more details on the configuration procedure and debugging.

## 7.11. Framework Version¶

If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.

For TensorFlow, use:

import tensorflow as tf
print(tf.__version__)


or

import tensorflow as tf
print(tf.version.VERSION)


For PyTorch, use:

$# Train your model as usual  TensorFlow graphs will be written to the current directory. ### 7.13.2. PyTorch¶ $ export GRAPH_VISUALIZATION=1
\$ # Train your model as usual


Synapse graphs will be written to .graph_dumps directory as *.pbtx files.

## 7.14. Error Codes¶

When making calls directly to the Synapse API, it is useful to check the return codes against the following symbolic or integer values to understand the outcome of the operation.

 synSuccess 0 The operation succeeded synInvalidArgument 1 An argument was invalid synCbFull 2 The command buffer is full synOutOfHostMemory 3 Out of host memory synOutOfDeviceMemory 4 Out of device memory synObjectAlreadyInitialized 5 The object being initialized is already initialized synObjectNotInitialized 6 The object must be initialized before the operation can be performed synCommandSubmissionFailure 7 The command buffer could not be submitted synNoDeviceFound 8 No Habana device was found synDeviceTypeMismatch 9 The operation is for the wrong device type synFailedToInitializeCb 10 The command buffer failed to initialize synFailedToFreeCb 11 The command buffer could not be freed synFailedToMapCb 12 The command buffer could not be mapped synFailedToUnmapCb 13 The command buffer could not be unmapped synFailedToAllocateDeviceMemory 14 Device memory could not be allocated synFailedToFreeDeviceMemory 15 Device memory could not be freed synFailedNotEnoughDevicesFound 16 A free device could not be found synDeviceReset 17 The operation failed because the device is being reset synUnsupported 18 The requested operation is not supported synWrongParamsFile 19 While loading a recipe, the binary parameters file failed to load synDeviceAlreadyAcquired 20 The referenced device is already occupied synNameIsAlreadyUsed 21 A tensor with the same name has already been created synBusy 22 The operation failed to complete within the timeout period synAllResourcesTaken 23 The event could not be created due to lack of resources synUnavailable 24 The time an event finished could not be retrieved synFail 25 The operation failed