Debugging Possible Model Errors

Generate Intel Gaudi Logs

If you encounter problems while training a model on Intel® Gaudi® AI accelerator, it is useful to frequently generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.

The generation of logging information and the location of logged information is controlled by environment variables. For example, if you set the following environment variables before training your model, a large amount of information will be generated under ~/.habana_logs/:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual

The below details the various environment variables and the description of their values.

Location of Log Files

ENABLE_CONSOLE=true outputs the logs to the console. If ENABLE_CONSOLE is not set at all or not set to true, logs are outputted in the directory specified by HABANA_LOGS. For example, if you set the following environment variables, all errors will be logged to the console:

$ export ENABLE_CONSOLE=true
$ export LOG_LEVEL_ALL=4
$ # Train your model as usual

Log Levels

0

Trace

Log everything including traces of progress.

1

Debug

Log all errors, warnings and all information useful for debugging.

2

Info

Log errors, warnings and some informative messages.

3

Warning

Log all errors and warnings.

4

Error

Log all errors.

5

Critical

Log only critical errors.

6

Off

Log nothing.

Component-Level Logs

The value of LOG_LEVEL_ALL=[log level] sets the logging level for all components.

To specify the log level for a particular component, append the name of the component to LOG_LEVEL_.

If you set LOG_LEVEL_ALL=5, all components log only critical errors. However, if you set LOG_LEVEL_SYN_API=3 for SYN_API, all errors and warnings are logged.

For example:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5
$ export LOG_LEVEL_SYN_API=3
$ # Train your model as usual

Names of Components that Produce Logs

Intel Gaudi Software API

SYN_API

Profiling Subsystem

SYN_PROF, PROF_hl[0-7] and HLPROF

Graph Compiler

PARSER, GC, and GRAPH_DATA

Intel Gaudi Performance Library

PERF_LIB

Habana Communication Library

HCL and HCL_SUBMISSIONS

Generate PyTorch Logs

You can set the following environment variable to obtain Intel Gaudi PyTorch bridge level logs:

$ export LOG_LEVEL_ALL_PT=[log level]

In case LOG_LEVEL_ALL_PT is not set, LOG_LEVEL_ALL will be used instead.

Refer to the Runtime Flags section for a full description of the PyTorch environment variables.

Error Codes

When making calls directly to SYN_API, it is useful to check the return codes against the following symbolic or integer values to understand the outcome of the operation.

synSuccess

0

The operation succeeded.

synInvalidArgument

1

An argument was invalid.

synCbFull

2

The command buffer is full.

synOutOfHostMemory

3

Out of host memory.

synOutOfDeviceMemory

4

Out of device memory.

synObjectAlreadyInitialized

5

The object being initialized is already initialized.

synObjectNotInitialized

6

The object must be initialized before the operation can be performed.

synCommandSubmissionFailure

7

The command buffer could not be submitted.

synNoDeviceFound

8

No Intel Gaudi device was found.

synDeviceTypeMismatch

9

The operation is for the wrong device type.

synFailedToInitializeCb

10

The command buffer failed to initialize.

synFailedToFreeCb

11

The command buffer could not be freed.

synFailedToMapCb

12

The command buffer could not be mapped.

synFailedToUnmapCb

13

The command buffer could not be unmapped.

synFailedToAllocateDeviceMemory

14

Device memory could not be allocated.

synFailedToFreeDeviceMemory

15

Device memory could not be freed.

synFailedNotEnoughDevicesFound

16

A free device could not be found.

synDeviceReset

17

The operation failed because the device is being reset.

synUnsupported

18

The requested operation is not supported.

synWrongParamsFile

19

While loading a recipe, the binary parameters file failed to load.

synDeviceAlreadyAcquired

20

The referenced device is already occupied.

synNameIsAlreadyUsed

21

A tensor with the same name has already been created.

synBusy

22

The operation failed to complete within the timeout period.

synAllResourcesTaken

23

The event could not be created due to lack of resources.

synUnavailable

24

The time an event finished could not be retrieved.

synInvalidTensorDimensions

25

High-rank tensor is attached to a node that does not support it.

synFail

26

The operation failed.

synOutOfResources

27

The operation failed due to lack of software memory.

synUninitialized

28

Intel Gaudi software library was not initialized before accessing it.

synAlreadyInitialized

29

The initialization failed because it was already initialized.

synFailedSectionValidation

30

The Launch operation failed because of section mismatch.

synSynapseTerminated

31

Intel Gaudi software cannot process the operation since it is going to be terminated.