Debugging Possible Model Errors
On this Page
Debugging Possible Model Errors¶
Generate Logs¶
If you encounter problems while training a model on Gaudi, it is frequently useful to generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.
The generation of logging information and the location of logged information is
controlled by environment variables. For example, if you set the following
environment variables before training your model, a large amount of information
will be generated under ~/.habana_logs/
:
$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual
The below details the various environment variables and the description of their values.
Location of Log Files¶
ENABLE_CONSOLE=true
outputs the logs to the console. If
ENABLE_CONSOLE
is not set at all or not set to true
, logs are output in the
directory specified by HABANA_LOGS
. For example, if you set the following environment variables,
all SynapseAI errors will be logged to the console:
$ export ENABLE_CONSOLE=true
$ export LOG_LEVEL_ALL=4
$ # Train your model as usual
Log Levels¶
0 |
Trace |
Log everything including traces of progress |
1 |
Debug |
Log all errors, warnings and all information useful for debugging |
2 |
Info |
Log errors, warnings and some informative messages |
3 |
Warning |
Log all errors and warnings |
4 |
Error |
Log all errors |
5 |
Critical |
Log only critical errors |
6 |
Off |
Log nothing |
Component-Level Logs¶
The value of LOG_LEVEL_ALL=[log level]
sets the logging level for all
components. However, it is sometimes useful to view detailed information for
a single component.
To specify the log level for a particular component, append the name of the
component to LOG_LEVEL_
.
For example, if you set the following environment variable, all
components will log only critical errors (set with LOG_LEVEL_ALL=5
) except for the Synapse API (set with LOG_LEVEL_SYN_API=3
),
which will log all errors and warnings:
$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5
$ export LOG_LEVEL_SYN_API=3
$ # Train your model as usual
Names of Components that Produce Logs¶
The Synapse API |
SYN_API |
The profiling subsystem |
SYN_PROF, PROF_hl[0-7] and HLPROF |
The graph compiler |
PARSER, GC, and GRAPH_DATA |
The Habana performance library |
PERF_LIB |
The Habana Communication Library |
HCL and HCL_SUBMISSIONS |
Generate PyTorch Logs¶
You can set the following environment variable to obtain PyTorch Habana Bridge level logs:
$ export LOG_LEVEL_ALL_PT=[log level]
In case LOG_LEVEL_ALL_PT
is not set, LOG_LEVEL_ALL
will be used instead.
Please refer to the Runtime Flags section for a description of the above environment variables.
Error Codes¶
When making calls directly to the SynapseAI API, it is useful to check the return codes against the following symbolic or integer values to understand the outcome of the operation.
synSuccess |
0 |
The operation succeeded |
synInvalidArgument |
1 |
An argument was invalid |
synCbFull |
2 |
The command buffer is full |
synOutOfHostMemory |
3 |
Out of host memory |
synOutOfDeviceMemory |
4 |
Out of device memory |
synObjectAlreadyInitialized |
5 |
The object being initialized is already initialized |
synObjectNotInitialized |
6 |
The object must be initialized before the operation can be performed |
synCommandSubmissionFailure |
7 |
The command buffer could not be submitted |
synNoDeviceFound |
8 |
No Habana device was found |
synDeviceTypeMismatch |
9 |
The operation is for the wrong device type |
synFailedToInitializeCb |
10 |
The command buffer failed to initialize |
synFailedToFreeCb |
11 |
The command buffer could not be freed |
synFailedToMapCb |
12 |
The command buffer could not be mapped |
synFailedToUnmapCb |
13 |
The command buffer could not be unmapped |
synFailedToAllocateDeviceMemory |
14 |
Device memory could not be allocated |
synFailedToFreeDeviceMemory |
15 |
Device memory could not be freed |
synFailedNotEnoughDevicesFound |
16 |
A free device could not be found |
synDeviceReset |
17 |
The operation failed because the device is being reset |
synUnsupported |
18 |
The requested operation is not supported |
synWrongParamsFile |
19 |
While loading a recipe, the binary parameters file failed to load |
synDeviceAlreadyAcquired |
20 |
The referenced device is already occupied |
synNameIsAlreadyUsed |
21 |
A tensor with the same name has already been created |
synBusy |
22 |
The operation failed to complete within the timeout period |
synAllResourcesTaken |
23 |
The event could not be created due to lack of resources |
synUnavailable |
24 |
The time an event finished could not be retrieved |
synInvalidTensorDimensions |
25 |
High-rank tensor is attached to a node that does not support it |
synFail |
26 |
The operation failed |
synOutOfResources |
27 |
The operation failed due to lack of SynapseAI memory |
synUninitialized |
28 |
SynapseAI Library was not initialized before accessing it |
synAlreadyInitialized |
29 |
SynapseAI initialize failed because it was already initialized |
synFailedSectionValidation |
30 |
The Launch operation failed because of section mismatch |
synSynapseTerminated |
31 |
SynapseAI cannot process the operation since it is going to be terminated |