Debugging Possible Model Errors

Generate Logs

If you encounter problems while training a model on Intel® Gaudi® AI accelerator, it is frequently useful to generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.

The generation of logging information and the location of logged information is controlled by environment variables. For example, if you set the following environment variables before training your model, a large amount of information will be generated under ~/.habana_logs/:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual

The below details the various environment variables and the description of their values.

Location of Log Files

ENABLE_CONSOLE=true outputs the logs to the console. If ENABLE_CONSOLE is not set at all or not set to true, logs are output in the directory specified by HABANA_LOGS. For example, if you set the following environment variables, all errors will be logged to the console:

$ export ENABLE_CONSOLE=true
$ export LOG_LEVEL_ALL=4
$ # Train your model as usual

Log Levels

0

Trace

Log everything including traces of progress

1

Debug

Log all errors, warnings and all information useful for debugging

2

Info

Log errors, warnings and some informative messages

3

Warning

Log all errors and warnings

4

Error

Log all errors

5

Critical

Log only critical errors

6

Off

Log nothing

Component-Level Logs

The value of LOG_LEVEL_ALL=[log level] sets the logging level for all components. However, it is sometimes useful to view detailed information for a single component.

To specify the log level for a particular component, append the name of the component to LOG_LEVEL_.

For example, if you set the following environment variable, all components will log only critical errors (set with LOG_LEVEL_ALL=5) except for SYN_API (set with LOG_LEVEL_SYN_API=3), which will log all errors and warnings:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5
$ export LOG_LEVEL_SYN_API=3
$ # Train your model as usual

Names of Components that Produce Logs

Intel Gaudi Software API

SYN_API

Profiling Subsystem

SYN_PROF, PROF_hl[0-7] and HLPROF

Graph Compiler

PARSER, GC, and GRAPH_DATA

Intel Gaudi Performance Library

PERF_LIB

Habana Communication Library

HCL and HCL_SUBMISSIONS

Generate PyTorch Logs

You can set the following environment variable to obtain Intel Gaudi PyTorch Bridge level logs:

$ export LOG_LEVEL_ALL_PT=[log level]

In case LOG_LEVEL_ALL_PT is not set, LOG_LEVEL_ALL will be used instead.

Please refer to the Runtime Flags section for a description of the above environment variables.

Error Codes

When making calls directly to SYN_API, it is useful to check the return codes against the following symbolic or integer values to understand the outcome of the operation.

synSuccess

0

The operation succeeded

synInvalidArgument

1

An argument was invalid

synCbFull

2

The command buffer is full

synOutOfHostMemory

3

Out of host memory

synOutOfDeviceMemory

4

Out of device memory

synObjectAlreadyInitialized

5

The object being initialized is already initialized

synObjectNotInitialized

6

The object must be initialized before the operation can be performed

synCommandSubmissionFailure

7

The command buffer could not be submitted

synNoDeviceFound

8

No Intel Gaudi device was found

synDeviceTypeMismatch

9

The operation is for the wrong device type

synFailedToInitializeCb

10

The command buffer failed to initialize

synFailedToFreeCb

11

The command buffer could not be freed

synFailedToMapCb

12

The command buffer could not be mapped

synFailedToUnmapCb

13

The command buffer could not be unmapped

synFailedToAllocateDeviceMemory

14

Device memory could not be allocated

synFailedToFreeDeviceMemory

15

Device memory could not be freed

synFailedNotEnoughDevicesFound

16

A free device could not be found

synDeviceReset

17

The operation failed because the device is being reset

synUnsupported

18

The requested operation is not supported

synWrongParamsFile

19

While loading a recipe, the binary parameters file failed to load

synDeviceAlreadyAcquired

20

The referenced device is already occupied

synNameIsAlreadyUsed

21

A tensor with the same name has already been created

synBusy

22

The operation failed to complete within the timeout period

synAllResourcesTaken

23

The event could not be created due to lack of resources

synUnavailable

24

The time an event finished could not be retrieved

synInvalidTensorDimensions

25

High-rank tensor is attached to a node that does not support it

synFail

26

The operation failed

synOutOfResources

27

The operation failed due to lack of software memory

synUninitialized

28

Intel Gaudi software Library was not initialized before accessing it

synAlreadyInitialized

29

The initialization failed because it was already initialized

synFailedSectionValidation

30

The Launch operation failed because of section mismatch

synSynapseTerminated

31

Intel Gaudi software cannot process the operation since it is going to be terminated