Debugging with Intel Gaudi Logs¶

If you encounter problems while training a model on Intel® Gaudi® AI accelerator, it is useful to frequently generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.

Generating Intel Gaudi Logs¶

Generate Intel Gaudi logs by following the simple steps below. If you want to report a model error, add the generated log files to a tar file and share it with Intel Gaudi support.

The generation of logging information and the location of logged information is controlled by environment variables. For the various environment variables and their values description, refer to Using Logs Environment Variables.

Generating Logs on Single-Server Setup¶

Set logging location by using the HABANA_LOGS environment variable. When running from bare metal, the default logs location is ~/.habana_logs/. When running inside a container, the default logs location is /var/log/habana_logs/. If you want to output the logs to the console, use ENABLE_CONSOLE=true. Refer to Runtime Flags for more details.
Set logging level and component level. Refer to Using Log Levels and Using Component-level Logs for more details.

Run the workload.

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual

If you are generating logs inside a container, use the below example:

$ export HABANA_LOGS=~/.habana_logs
$ echo $HABANA_LOGS
/root/.habana_logs/ #Logs are saved under home directory
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual

Output example when running on one card:

$ ls -la ~/.habana_logs/
total 1808980
-rw-r--r-- 1 root root       3772 Jul  7 01:22 gcfg_log.txt
-rw-r--r-- 1 root root    3636387 Jul  7 01:22 scal_log.txt
-rw-r--r-- 1 root root   50180166 Jul  7 01:22 synapse_runtime.log
-rw-r--r-- 1 root root     191689 Jul  7 01:22 synapse_utils_log.txt
-rw-r--r-- 1 root root    4190520 Jul  7 01:22 pytorch_log.txt
-rw-r--r-- 1 root root       4190 Jul  7 01:22 shim_core_log.txt
-rw-r--r-- 1 root root    4517508 Jul  7 01:22 shim_plugins_log.txt
-rw-r--r-- 1 root root 1584609719 Jul  7 01:22 graph_compiler.log
-rw-r--r-- 1 root root      20903 Jul  7 01:22 perf_measure.log
-rw-r--r-- 1 root root    1273330 Jul  7 01:22 synapse_log.txt
-rw-r--r-- 1 root root     667814 Jul  7 01:22 hcl.log
-rw-r--r-- 1 root root    5051346 Jul  7 01:22 pytorch_log.1.txt
-rw-r--r-- 1 root root    5051346 Jul  7 01:22 pytorch_log.2.txt
-rw-r--r-- 1 root root    5051346 Jul  7 01:22 pytorch_log.3.txt
-rw-r--r-- 1 root root    5051345 Jul  7 01:22 pytorch_log.4.txt
-rw-r--r-- 1 root root    5051344 Jul  7 01:22 pytorch_log.5.txt
-rw-r--r-- 1 root root  104857588 Jul  7 01:22 synapse_runtime.1.log
-rw-r--r-- 1 root root      35011 Jul  7 01:22 recipe_stats.log
-rw-r--r-- 1 root root    2901887 Jul  7 01:22 perf_lib_log.txt
-rw-r--r-- 1 root root     267494 Jul  7 01:22 fuser_lib_log.txt
-rw-r--r-- 1 root root     804231 Jul  7 01:22 perf_lib_suggest_log.txt
-rw-r--r-- 1 root root     186550 Jul  7 01:22 complex_guid_log.txt
-rw-r--r-- 1 root root   10485737 Jul  7 01:22 synapse_log.1.txt
-rw-r--r-- 1 root root    1048540 Jul  7 01:22 complex_guid_log.1.txt
-rw-r--r-- 1 root root   10485744 Jul  7 01:22 synapse_log.2.txt
-rw-r--r-- 1 root root   10485630 Jul  7 01:22 synapse_log.3.txt
-rw-r--r-- 1 root root   10485606 Jul  7 01:22 synapse_log.4.txt
-rw-r--r-- 1 root root   10485614 Jul  7 01:22 synapse_log.5.txt
-rw-r--r-- 1 root root    9998897 Jul  7 01:22 perf_lib_log.1.txt
-rw-r--r-- 1 root root    1048506 Jul  7 01:22 fuser_lib_log.1.txt
-rw-r--r-- 1 root root    1048563 Jul  7 01:22 complex_guid_log.2.txt
-rw-r--r-- 1 root root    1048513 Jul  7 01:22 complex_guid_log.3.txt
-rw-r--r-- 1 root root    1048510 Jul  7 01:22 complex_guid_log.4.txt
-rw-r--r-- 1 root root    1048535 Jul  7 01:22 complex_guid_log.5.txt
-rw-r--r-- 1 root root        238 Jul  7 01:22 synprof_parser_log.txt
-rw-r--r-- 1 root root         89 Jul  7 01:22 event_triggered_logger.log
-rw-r--r-- 1 root root        312 Jul  7 01:22 shared_lib_log.txt
-rw-r--r-- 1 root root          0 Jul  7 01:22 perf_lib_sif_log.txt
-rw-r--r-- 1 root root          0 Jul  7 01:21 graph_compiler_perf.log

In a multi-card case, logs for each card are saved in separate subdirectories under a corresponding number in the habana_logs directory .Output example when running on two cards:

$ ls -la ~/.habana_logs/0/
total 16
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler_perf.log
-rw-r--r-- 1 root root   70 Jul  7 02:28 hcl.log
-rw-r--r-- 1 root root  203 Jul  7 02:29 hcl_coordinator.log
-rw-r--r-- 1 root root  514 Jul  7 02:28 scal_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shared_lib_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_core_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_plugins_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_log.txt
-rw-r--r-- 1 root root 1125 Jul  7 02:28 synapse_runtime.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_utils_log.txt

$ ls -la ~/.habana_logs/1/
total 12
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler_perf.log
-rw-r--r-- 1 root root   70 Jul  7 02:28 hcl.log
-rw-r--r-- 1 root root  514 Jul  7 02:28 scal_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shared_lib_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_core_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_plugins_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_log.txt
-rw-r--r-- 1 root root 1125 Jul  7 02:28 synapse_runtime.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_utils_log.txt

Generating Logs on Multi-Server Setup¶

When running the workloads on multiple nodes, some files might be overwritten by different nodes when they share the home file system. Therefore, it is recommended to set the logging location to a node’s local disk as shown in the example below:

$ export HABANA_LOGS=<path-to-node-local-disk>/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual

Using Logs Environment Variables¶

The following sections provide details on using Intel Gaudi logs environment variables. These variables can be used to configure the logging information and set its output location.

Using Log Levels¶

You can generate Intel Gaudi logs for different system levels and components by using LOG_LEVEL_ALL environment variable.

The value of LOG_LEVEL_ALL=[log level] sets the logging level for all components including the Intel Gaudi PyTorch bridge. The log levels are outlined in the below table:

0	Trace	Log everything including traces of progress.
1	Debug	Log all errors, warnings and all information useful for debugging.
2	Info	Log errors, warnings and some informative messages.
3	Warning	Log all errors and warnings.
4	Error	Log all errors.
5	Critical	Log only critical errors.
6	Off	Log nothing.

Example:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5 # all components log only critical errors
$ # Train your model as usual

Using Component-level Logs¶

The value of LOG_LEVEL_[component]=[log level] sets the logging level for a specific component. The component-level logs are outlined in the below table:

Intel Gaudi Software API	SYN_API
Profiling Subsystem	SYN_PROF PROF_hl[0-7] HLPROF
Graph Compiler	PARSER GC GRAPH_DATA
Intel Gaudi Performance Library	PERF_LIB
Habana Communication Library	HCL HCL_SUBMISSIONS

Example:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5 # all components log only critical errors
$ export LOG_LEVEL_SYN_API=3 # all errors and warnings are logged for SYN_API
$ # Train your model as usual

Generating PyTorch Logs¶

You can set LOG_LEVEL_ALL_PT environment variable to obtain Intel Gaudi PyTorch bridge level logs. In case LOG_LEVEL_ALL_PT is not set, LOG_LEVEL_ALL is used instead.

Example:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL_PT=[log level]
$ # Train your model as usual

Refer to the Runtime Flags section for a full description of the PyTorch environment variables.

`SYN_API` Error Codes¶

When making calls directly to SYN_API, it is useful to check the return codes against the following symbolic or integer values to understand the outcome of the operation.

synSuccess	0	The operation succeeded.
synInvalidArgument	1	An argument was invalid.
synCbFull	2	The command buffer is full.
synOutOfHostMemory	3	Out of host memory.
synOutOfDeviceMemory	4	Out of device memory.
synObjectAlreadyInitialized	5	The object being initialized is already initialized.
synObjectNotInitialized	6	The object must be initialized before the operation can be performed.
synCommandSubmissionFailure	7	The command buffer could not be submitted.
synNoDeviceFound	8	No Intel Gaudi device was found.
synDeviceTypeMismatch	9	The operation is for the wrong device type.
synFailedToInitializeCb	10	The command buffer failed to initialize.
synFailedToFreeCb	11	The command buffer could not be freed.
synFailedToMapCb	12	The command buffer could not be mapped.
synFailedToUnmapCb	13	The command buffer could not be unmapped.
synFailedToAllocateDeviceMemory	14	Device memory could not be allocated.
synFailedToFreeDeviceMemory	15	Device memory could not be freed.
synFailedNotEnoughDevicesFound	16	A free device could not be found.
synDeviceReset	17	The operation failed because the device is being reset.
synUnsupported	18	The requested operation is not supported.
synWrongParamsFile	19	While loading a recipe, the binary parameters file failed to load.
synDeviceAlreadyAcquired	20	The referenced device is already occupied.
synNameIsAlreadyUsed	21	A tensor with the same name has already been created.
synBusy	22	The operation failed to complete within the timeout period.
synAllResourcesTaken	23	The event could not be created due to lack of resources.
synUnavailable	24	The time an event finished could not be retrieved.
synInvalidTensorDimensions	25	High-rank tensor is attached to a node that does not support it.
synFail	26	The operation failed.
synOutOfResources	27	The operation failed due to lack of software memory.
synUninitialized	28	Intel Gaudi software library was not initialized before accessing it.
synAlreadyInitialized	29	The initialization failed because it was already initialized.
synFailedSectionValidation	30	The Launch operation failed because of section mismatch.
synSynapseTerminated	31	Intel Gaudi software cannot process the operation since it is going to be terminated.
synAssertAsync	32	The operation failed due to an assert-async event.
synInvalidEventHandle	33	Invalid event handle.
synMappingNotFound	34	MMU mapping for the given address is missing.
synFailedDynamicPatching	35	Dynamic patching failed.
synFailedStaticPatching	36	Static patching failed.
synFailedToSubmitWorkload	37	The workload could not be submitted.
synInvalidSectionsDefinition	38	Invalid tensors’ addresses or missing sections’ address.
synInvalidTensorProperties	39	Invalid tensor properties.
synFailHccl	40	HCCL failed.
synFailedToCollectTime	41	Time collection failed.
synTimeout	42	The operation failed due to driver timeout.
synResourceBadUsage	43	The resource is not used properly.

Gaudi Documentation 1.21.1 documentation

Debugging with Intel Gaudi Logs

On this Page

Debugging with Intel Gaudi Logs¶

Generating Intel Gaudi Logs¶

Generating Logs on Single-Server Setup¶

Generating Logs on Multi-Server Setup¶

Using Logs Environment Variables¶

Using Log Levels¶

Using Component-level Logs¶

Generating PyTorch Logs¶

`SYN_API` Error Codes¶

Gaudi Documentation 1.21.1 documentation

Debugging with Intel Gaudi Logs

On this Page

Debugging with Intel Gaudi Logs¶

Generating Intel Gaudi Logs¶

Generating Logs on Single-Server Setup¶

Generating Logs on Multi-Server Setup¶

Using Logs Environment Variables¶

Using Log Levels¶

Using Component-level Logs¶

Generating PyTorch Logs¶

SYN_API Error Codes¶

`SYN_API` Error Codes¶