Debugging with Intel Gaudi Logs
On this Page
Debugging with Intel Gaudi Logs¶
If you encounter problems while training a model on Intel® Gaudi® AI accelerator, it is useful to frequently generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.
Generating Intel Gaudi Logs¶
Generate Intel Gaudi logs by following the simple steps below. If you want to report a model error, add the generated log files to a tar file and share it with Intel Gaudi support.
The generation of logging information and the location of logged information is controlled by environment variables. For the various environment variables and their values description, refer to Using Logs Environment Variables.
Generating Logs on Single-Server Setup¶
Set logging location by using the
HABANA_LOGS
environment variable. When running from bare metal, the default logs location is~/.habana_logs/
. When running inside a container, the default logs location is/var/log/habana_logs/
. If you want to output the logs to the console, useENABLE_CONSOLE=true
. Refer to Runtime Flags for more details.Set logging level and component level. Refer to Using Log Levels and Using Component-level Logs for more details.
Run the workload.
$ export HABANA_LOGS=~/.habana_logs $ export LOG_LEVEL_ALL=0 $ # Train your model as usual
If you are generating logs inside a container, use the below example:
$ export HABANA_LOGS=~/.habana_logs $ echo $HABANA_LOGS /root/.habana_logs/ #Logs are saved under home directory $ export LOG_LEVEL_ALL=0 $ # Train your model as usual
Output example when running on one card:
$ ls -la ~/.habana_logs/
total 1808980
-rw-r--r-- 1 root root 3772 Jul 7 01:22 gcfg_log.txt
-rw-r--r-- 1 root root 3636387 Jul 7 01:22 scal_log.txt
-rw-r--r-- 1 root root 50180166 Jul 7 01:22 synapse_runtime.log
-rw-r--r-- 1 root root 191689 Jul 7 01:22 synapse_utils_log.txt
-rw-r--r-- 1 root root 4190520 Jul 7 01:22 pytorch_log.txt
-rw-r--r-- 1 root root 4190 Jul 7 01:22 shim_core_log.txt
-rw-r--r-- 1 root root 4517508 Jul 7 01:22 shim_plugins_log.txt
-rw-r--r-- 1 root root 1584609719 Jul 7 01:22 graph_compiler.log
-rw-r--r-- 1 root root 20903 Jul 7 01:22 perf_measure.log
-rw-r--r-- 1 root root 1273330 Jul 7 01:22 synapse_log.txt
-rw-r--r-- 1 root root 667814 Jul 7 01:22 hcl.log
-rw-r--r-- 1 root root 5051346 Jul 7 01:22 pytorch_log.1.txt
-rw-r--r-- 1 root root 5051346 Jul 7 01:22 pytorch_log.2.txt
-rw-r--r-- 1 root root 5051346 Jul 7 01:22 pytorch_log.3.txt
-rw-r--r-- 1 root root 5051345 Jul 7 01:22 pytorch_log.4.txt
-rw-r--r-- 1 root root 5051344 Jul 7 01:22 pytorch_log.5.txt
-rw-r--r-- 1 root root 104857588 Jul 7 01:22 synapse_runtime.1.log
-rw-r--r-- 1 root root 35011 Jul 7 01:22 recipe_stats.log
-rw-r--r-- 1 root root 2901887 Jul 7 01:22 perf_lib_log.txt
-rw-r--r-- 1 root root 267494 Jul 7 01:22 fuser_lib_log.txt
-rw-r--r-- 1 root root 804231 Jul 7 01:22 perf_lib_suggest_log.txt
-rw-r--r-- 1 root root 186550 Jul 7 01:22 complex_guid_log.txt
-rw-r--r-- 1 root root 10485737 Jul 7 01:22 synapse_log.1.txt
-rw-r--r-- 1 root root 1048540 Jul 7 01:22 complex_guid_log.1.txt
-rw-r--r-- 1 root root 10485744 Jul 7 01:22 synapse_log.2.txt
-rw-r--r-- 1 root root 10485630 Jul 7 01:22 synapse_log.3.txt
-rw-r--r-- 1 root root 10485606 Jul 7 01:22 synapse_log.4.txt
-rw-r--r-- 1 root root 10485614 Jul 7 01:22 synapse_log.5.txt
-rw-r--r-- 1 root root 9998897 Jul 7 01:22 perf_lib_log.1.txt
-rw-r--r-- 1 root root 1048506 Jul 7 01:22 fuser_lib_log.1.txt
-rw-r--r-- 1 root root 1048563 Jul 7 01:22 complex_guid_log.2.txt
-rw-r--r-- 1 root root 1048513 Jul 7 01:22 complex_guid_log.3.txt
-rw-r--r-- 1 root root 1048510 Jul 7 01:22 complex_guid_log.4.txt
-rw-r--r-- 1 root root 1048535 Jul 7 01:22 complex_guid_log.5.txt
-rw-r--r-- 1 root root 238 Jul 7 01:22 synprof_parser_log.txt
-rw-r--r-- 1 root root 89 Jul 7 01:22 event_triggered_logger.log
-rw-r--r-- 1 root root 312 Jul 7 01:22 shared_lib_log.txt
-rw-r--r-- 1 root root 0 Jul 7 01:22 perf_lib_sif_log.txt
-rw-r--r-- 1 root root 0 Jul 7 01:21 graph_compiler_perf.log
In a multi-card case, logs for each card are saved in separate subdirectories under a corresponding number in the habana_logs
directory .Output example when running on two cards:
$ ls -la ~/.habana_logs/0/
total 16
-rw-r--r-- 1 root root 0 Jul 7 02:28 graph_compiler.log
-rw-r--r-- 1 root root 0 Jul 7 02:28 graph_compiler_perf.log
-rw-r--r-- 1 root root 70 Jul 7 02:28 hcl.log
-rw-r--r-- 1 root root 203 Jul 7 02:29 hcl_coordinator.log
-rw-r--r-- 1 root root 514 Jul 7 02:28 scal_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 shared_lib_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 shim_core_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 shim_plugins_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 synapse_log.txt
-rw-r--r-- 1 root root 1125 Jul 7 02:28 synapse_runtime.log
-rw-r--r-- 1 root root 0 Jul 7 02:28 synapse_utils_log.txt
$ ls -la ~/.habana_logs/1/
total 12
-rw-r--r-- 1 root root 0 Jul 7 02:28 graph_compiler.log
-rw-r--r-- 1 root root 0 Jul 7 02:28 graph_compiler_perf.log
-rw-r--r-- 1 root root 70 Jul 7 02:28 hcl.log
-rw-r--r-- 1 root root 514 Jul 7 02:28 scal_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 shared_lib_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 shim_core_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 shim_plugins_log.txt
-rw-r--r-- 1 root root 0 Jul 7 02:28 synapse_log.txt
-rw-r--r-- 1 root root 1125 Jul 7 02:28 synapse_runtime.log
-rw-r--r-- 1 root root 0 Jul 7 02:28 synapse_utils_log.txt
Generating Logs on Multi-Server Setup¶
When running the workloads on multiple nodes, some files might be overwritten by different nodes when they share the home file system. Therefore, it is recommended to set the logging location to a node’s local disk as shown in the example below:
$ export HABANA_LOGS=<path-to-node-local-disk>/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual
Using Logs Environment Variables¶
The following sections provide details on using Intel Gaudi logs environment variables. These variables can be used to configure the logging information and set its output location.
Using Log Levels¶
You can generate Intel Gaudi logs for different system levels and components by using
LOG_LEVEL_ALL
environment variable.
The value of LOG_LEVEL_ALL=[log level]
sets the logging level for all components including the Intel Gaudi PyTorch bridge.
The log levels are outlined in the below table:
0 |
Trace |
Log everything including traces of progress. |
1 |
Debug |
Log all errors, warnings and all information useful for debugging. |
2 |
Info |
Log errors, warnings and some informative messages. |
3 |
Warning |
Log all errors and warnings. |
4 |
Error |
Log all errors. |
5 |
Critical |
Log only critical errors. |
6 |
Off |
Log nothing. |
Example:
$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5 # all components log only critical errors
$ # Train your model as usual
Using Component-level Logs¶
The value of LOG_LEVEL_[component]=[log level]
sets the logging level for a specific component.
The component-level logs are outlined in the below table:
Intel Gaudi Software API |
SYN_API |
Profiling Subsystem |
|
Graph Compiler |
|
Intel Gaudi Performance Library |
PERF_LIB |
Habana Communication Library |
|
Example:
$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5 # all components log only critical errors
$ export LOG_LEVEL_SYN_API=3 # all errors and warnings are logged for SYN_API
$ # Train your model as usual
Generating PyTorch Logs¶
You can set LOG_LEVEL_ALL_PT
environment variable to obtain Intel Gaudi PyTorch bridge level logs.
In case LOG_LEVEL_ALL_PT
is not set, LOG_LEVEL_ALL
is used instead.
Example:
$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL_PT=[log level]
$ # Train your model as usual
Refer to the Runtime Flags section for a full description of the PyTorch environment variables.
SYN_API
Error Codes¶
When making calls directly to SYN_API
, it is useful to check the return codes against the following
symbolic or integer values to understand the outcome of the operation.
synSuccess |
0 |
The operation succeeded. |
synInvalidArgument |
1 |
An argument was invalid. |
synCbFull |
2 |
The command buffer is full. |
synOutOfHostMemory |
3 |
Out of host memory. |
synOutOfDeviceMemory |
4 |
Out of device memory. |
synObjectAlreadyInitialized |
5 |
The object being initialized is already initialized. |
synObjectNotInitialized |
6 |
The object must be initialized before the operation can be performed. |
synCommandSubmissionFailure |
7 |
The command buffer could not be submitted. |
synNoDeviceFound |
8 |
No Intel Gaudi device was found. |
synDeviceTypeMismatch |
9 |
The operation is for the wrong device type. |
synFailedToInitializeCb |
10 |
The command buffer failed to initialize. |
synFailedToFreeCb |
11 |
The command buffer could not be freed. |
synFailedToMapCb |
12 |
The command buffer could not be mapped. |
synFailedToUnmapCb |
13 |
The command buffer could not be unmapped. |
synFailedToAllocateDeviceMemory |
14 |
Device memory could not be allocated. |
synFailedToFreeDeviceMemory |
15 |
Device memory could not be freed. |
synFailedNotEnoughDevicesFound |
16 |
A free device could not be found. |
synDeviceReset |
17 |
The operation failed because the device is being reset. |
synUnsupported |
18 |
The requested operation is not supported. |
synWrongParamsFile |
19 |
While loading a recipe, the binary parameters file failed to load. |
synDeviceAlreadyAcquired |
20 |
The referenced device is already occupied. |
synNameIsAlreadyUsed |
21 |
A tensor with the same name has already been created. |
synBusy |
22 |
The operation failed to complete within the timeout period. |
synAllResourcesTaken |
23 |
The event could not be created due to lack of resources. |
synUnavailable |
24 |
The time an event finished could not be retrieved. |
synInvalidTensorDimensions |
25 |
High-rank tensor is attached to a node that does not support it. |
synFail |
26 |
The operation failed. |
synOutOfResources |
27 |
The operation failed due to lack of software memory. |
synUninitialized |
28 |
Intel Gaudi software library was not initialized before accessing it. |
synAlreadyInitialized |
29 |
The initialization failed because it was already initialized. |
synFailedSectionValidation |
30 |
The Launch operation failed because of section mismatch. |
synSynapseTerminated |
31 |
Intel Gaudi software cannot process the operation since it is going to be terminated. |
synAssertAsync |
32 |
The operation failed due to an assert-async event. |
synInvalidEventHandle |
33 |
Invalid event handle. |
synMappingNotFound |
34 |
MMU mapping for the given address is missing. |
synFailedDynamicPatching |
35 |
Dynamic patching failed. |
synFailedStaticPatching |
36 |
Static patching failed. |
synFailedToSubmitWorkload |
37 |
The workload could not be submitted. |
synInvalidSectionsDefinition |
38 |
Invalid tensors’ addresses or missing sections’ address. |
synInvalidTensorProperties |
39 |
Invalid tensor properties. |
synFailHccl |
40 |
HCCL failed. |
synFailedToCollectTime |
41 |
Time collection failed. |
synTimeout |
42 |
The operation failed due to driver timeout. |
synResourceBadUsage |
43 |
The resource is not used properly. |