7. Debugging Guide

7.1. General Recommendations

Below are suggested courses of action to try if you encounter various issues while training on Habana® Gaudi®. These can be tried in combination with data science best practices.

Note

Habana’s integration with TensorFlow does not currently support eager mode. Eager mode functions such as tf.config.experimental_run_functions_eagerly will not function as expected.

Note

Habana’s integration with Pytorch supports eager mode and is the default execution mode.

7.1.2. If Your Model Diverges

7.1.3. If Your Model Converges Slowly

Note

TensorBoard visualization has limited support in Habana’s integration with TensorFlow.

7.2. TensorBoard Usage

SynapseAI® Software can generate data representing HPU clusters to be visualized by TensorBoard. When TensorBoard visualization is enabled through the TFv2 native APIs (trace_on/trace_export), SynapseAI adds a tag, post_optimization_graph, visualizing the clustered TF graph. Furthermore, if the environment variable GRAPH_VISUALIZATION=1, additional tags will be created for each Habana Op cluster, visualizing the cluster’s pre and post Synapse graphs with respect to the Graph Compiler’s graph compilation.

For known issues and limitations refer to the Release Notes.

7.3. Generate Logs

If you encounter problems while training a model on Gaudi, it is frequently useful to generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.

The generation of logging information and the location of logged information is controlled by environment variables. For example, if you set the following environment variables before training your model, a large amount of information will be generated under ~/.habana_logs/:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual

The below details the various environment variables and the description of their values.

7.3.1. Location of Log Files

ENABLE_CONSOLE=true outputs the logs to the console. If ENABLE_CONSOLE is not set at all or not set to true, logs are output in the directory specified by HABANA_LOGS. For example, if you set the following environment variables, all SynapseAI errors will be logged to the console:

$ export ENABLE_CONSOLE=true
$ export LOG_LEVEL_ALL=4
$ # Train your model as usual

7.3.2. Log Levels

0

Trace

Log everything including traces of progress

1

Debug

Log all errors, warnings and all information useful for debugging

2

Info

Log errors, warnings and some informative messages

3

Warning

Log all errors and warnings

4

Error

Log all errors

5

Critical

Log only critical errors

6

Off

Log nothing

7.3.3. Component-Level Logs

The value of LOG_LEVEL_ALL=[log level] sets the logging level for all components. However, it is sometimes useful to view detailed information for a single component.

To specify the log level for a particular component, append the name of the component to LOG_LEVEL_.

For example, if you set the following environment variable, all components will log only critical errors (set with LOG_LEVEL_ALL=5) except for the Synapse API (set with LOG_LEVEL_SYN_API=3), which will log all errors and warnings:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5
$ export LOG_LEVEL_SYN_API=3
$ # Train your model as usual

7.3.4. Names of Components that Produce Logs

The Synapse API

SYN_API

The profiling subsystem

SYN_PROF, PROF_hl[0-7] and HLPROF

The graph compiler

PARSER, GC, and GRAPH_DATA

The Habana performance library

PERF_LIB

The Habana Communication Library

HCL and HCL_SUBMISSIONS

7.4. Framework/Bridge Logs

7.4.1. TensorFlow

You can set the following environment variables to obtain TensorFlow Habana Bridge level logs:

$ export TF_CPP_MIN_LOG_LEVEL=xxxx
$ export TF_CPP_MIN_VLOG_LEVEL=yyyy

Please refer to the Runtime Flags section for a description of the above environment variables.

7.4.2. PyTorch

You can set the following environment variables to obtain PyTorch Habana Bridge level logs:

$ export PT_HPU_LOG_MOD_MASK=xxxx
$ export PT_HPU_LOG_TYPE_MASK=yyyy

Please refer to the Runtime Flags section for a description of the above environment variables.

7.5. Log TensorFlow Operation Assignments

tf.debugging.set_log_device_placement(True) prints assignments of tensors and operations to devices.

7.6. Move TensorFlow Operators

Under certain circumstances, a Habana operator will not support a tensor having the given shape or data type. If the tensor cannot be reshaped or cast to a supported type, for TensorFlow models, it is useful to use the tf.device method to schedule the operator for execution on the CPU.

7.7. Let TensorFlow Choose the Device

Adding tf.config.set_soft_device_placement(True) may prevent some compilation errors.

7.8. TPC Fuser

The TPC Fuser is an optimizing compiler for TPC kernels. The TPC Fuser can be enabled by exporting the environment variable RUN_TPC_FUSER=true.

$ export RUN_TPC_FUSER=true
$ # Train your model as usual

Enabling the TPC Fuser can result in numerical issues like NaN loss. Under such circumstances, it can be helpful to turn off the TPC Fuser.

$ export RUN_TPC_FUSER=false
$ # Train your model as usual

7.10. Other Data Types

If a model converges on CPU but fails to converge on Gaudi, it is useful to experiment with other data types. For example, use the FP32 data type instead of BF16.

7.10.1. PyTorch

Once FP32 based model converges, you may want to experiment with different mixed precision configurations to arrive at a model with optimal performance/accuracy benefits. Please refer to the PyTorch Mixed Precision Training on Gaudi section for more details on the configuration procedure and debugging.

7.11. Framework Version

If your model converges successfully on CPU or GPU but fails to converge on Gaudi, check for a mismatch between the version of the framework you are running on Gaudi and the version of the framework you are running on the CPU or GPU.

For TensorFlow, use:

import tensorflow as tf
print(tf.__version__)

or

import tensorflow as tf
print(tf.version.VERSION)

For PyTorch, use:

$ python3 -c "import torch; print(torch.__version__)"

If you train using an existing checkpoint and your model fails to converge on Gaudi, check for a mismatch between the version of the framework you are running and the Gaudi framework version that generated the checkpoint.

7.12. hl-smi

The hl-smi tool is included in the Habana software distribution. It can display a variety of information about the Gaudi hardware, and execute in a loop to display changes over time. Run hl-smi -h for additional documentation.

7.13. Model Graph

Set the following environment variables to generate a dump of the training graph:

7.13.1. TensorFlow

$ export LOG_LEVEL_GRAPH_DATA=0 GRAPH_VISUALIZATION=1 HBN_TF_GRAPH_DUMP=2
$ # Train your model as usual

TensorFlow graphs will be written to the current directory.

7.13.2. PyTorch

$ export GRAPH_VISUALIZATION=1
$ # Train your model as usual

Synapse graphs will be written to .graph_dumps directory as *.pbtx files.

7.14. Error Codes

When making calls directly to the Synapse API, it is useful to check the return codes against the following symbolic or integer values to understand the outcome of the operation.

synSuccess

0

The operation succeeded

synInvalidArgument

1

An argument was invalid

synCbFull

2

The command buffer is full

synOutOfHostMemory

3

Out of host memory

synOutOfDeviceMemory

4

Out of device memory

synObjectAlreadyInitialized

5

The object being initialized is already initialized

synObjectNotInitialized

6

The object must be initialized before the operation can be performed

synCommandSubmissionFailure

7

The command buffer could not be submitted

synNoDeviceFound

8

No Habana device was found

synDeviceTypeMismatch

9

The operation is for the wrong device type

synFailedToInitializeCb

10

The command buffer failed to initialize

synFailedToFreeCb

11

The command buffer could not be freed

synFailedToMapCb

12

The command buffer could not be mapped

synFailedToUnmapCb

13

The command buffer could not be unmapped

synFailedToAllocateDeviceMemory

14

Device memory could not be allocated

synFailedToFreeDeviceMemory

15

Device memory could not be freed

synFailedNotEnoughDevicesFound

16

A free device could not be found

synDeviceReset

17

The operation failed because the device is being reset

synUnsupported

18

The requested operation is not supported

synWrongParamsFile

19

While loading a recipe, the binary parameters file failed to load

synDeviceAlreadyAcquired

20

The referenced device is already occupied

synNameIsAlreadyUsed

21

A tensor with the same name has already been created

synBusy

22

The operation failed to complete within the timeout period

synAllResourcesTaken

23

The event could not be created due to lack of resources

synUnavailable

24

The time an event finished could not be retrieved

synFail

25

The operation failed