Debugging with Intel Gaudi Logs

If you encounter problems while training a model on Intel® Gaudi® AI accelerator, it is useful to frequently generate and inspect your log files. By inspecting log files, you can pinpoint where a model failure is occurring, and alter your model or training script to resolve or work around defects.

Generating Intel Gaudi Logs

Generate Intel Gaudi logs by following the simple steps below. If you want to report a model error, add the generated log files to a tar file and share it with Intel Gaudi support.

The generation of logging information and the location of logged information is controlled by environment variables. For the various environment variables and their values description, refer to Using Logs Environment Variables.

Generating Logs on Single-Server Setup

  1. Set logging location by using the HABANA_LOGS environment variable. When running from bare metal, the default logs location is ~/.habana_logs/. When running inside a container, the default logs location is /var/log/habana_logs/. If you want to output the logs to the console, use ENABLE_CONSOLE=true. Refer to Runtime Flags for more details.

  2. Set logging level and component level. Refer to Using Log Levels and Using Component-level Logs for more details.

  3. Run the workload.

    $ export HABANA_LOGS=~/.habana_logs
    $ export LOG_LEVEL_ALL=0
    $ # Train your model as usual
    

    If you are generating logs inside a container, use the below example:

    $ export HABANA_LOGS=~/.habana_logs
    $ echo $HABANA_LOGS
    /root/.habana_logs/ #Logs are saved under home directory
    $ export LOG_LEVEL_ALL=0
    $ # Train your model as usual
    

Output example when running on one card:

$ ls -la ~/.habana_logs/
total 1808980
-rw-r--r-- 1 root root       3772 Jul  7 01:22 gcfg_log.txt
-rw-r--r-- 1 root root    3636387 Jul  7 01:22 scal_log.txt
-rw-r--r-- 1 root root   50180166 Jul  7 01:22 synapse_runtime.log
-rw-r--r-- 1 root root     191689 Jul  7 01:22 synapse_utils_log.txt
-rw-r--r-- 1 root root    4190520 Jul  7 01:22 pytorch_log.txt
-rw-r--r-- 1 root root       4190 Jul  7 01:22 shim_core_log.txt
-rw-r--r-- 1 root root    4517508 Jul  7 01:22 shim_plugins_log.txt
-rw-r--r-- 1 root root 1584609719 Jul  7 01:22 graph_compiler.log
-rw-r--r-- 1 root root      20903 Jul  7 01:22 perf_measure.log
-rw-r--r-- 1 root root    1273330 Jul  7 01:22 synapse_log.txt
-rw-r--r-- 1 root root     667814 Jul  7 01:22 hcl.log
-rw-r--r-- 1 root root    5051346 Jul  7 01:22 pytorch_log.1.txt
-rw-r--r-- 1 root root    5051346 Jul  7 01:22 pytorch_log.2.txt
-rw-r--r-- 1 root root    5051346 Jul  7 01:22 pytorch_log.3.txt
-rw-r--r-- 1 root root    5051345 Jul  7 01:22 pytorch_log.4.txt
-rw-r--r-- 1 root root    5051344 Jul  7 01:22 pytorch_log.5.txt
-rw-r--r-- 1 root root  104857588 Jul  7 01:22 synapse_runtime.1.log
-rw-r--r-- 1 root root      35011 Jul  7 01:22 recipe_stats.log
-rw-r--r-- 1 root root    2901887 Jul  7 01:22 perf_lib_log.txt
-rw-r--r-- 1 root root     267494 Jul  7 01:22 fuser_lib_log.txt
-rw-r--r-- 1 root root     804231 Jul  7 01:22 perf_lib_suggest_log.txt
-rw-r--r-- 1 root root     186550 Jul  7 01:22 complex_guid_log.txt
-rw-r--r-- 1 root root   10485737 Jul  7 01:22 synapse_log.1.txt
-rw-r--r-- 1 root root    1048540 Jul  7 01:22 complex_guid_log.1.txt
-rw-r--r-- 1 root root   10485744 Jul  7 01:22 synapse_log.2.txt
-rw-r--r-- 1 root root   10485630 Jul  7 01:22 synapse_log.3.txt
-rw-r--r-- 1 root root   10485606 Jul  7 01:22 synapse_log.4.txt
-rw-r--r-- 1 root root   10485614 Jul  7 01:22 synapse_log.5.txt
-rw-r--r-- 1 root root    9998897 Jul  7 01:22 perf_lib_log.1.txt
-rw-r--r-- 1 root root    1048506 Jul  7 01:22 fuser_lib_log.1.txt
-rw-r--r-- 1 root root    1048563 Jul  7 01:22 complex_guid_log.2.txt
-rw-r--r-- 1 root root    1048513 Jul  7 01:22 complex_guid_log.3.txt
-rw-r--r-- 1 root root    1048510 Jul  7 01:22 complex_guid_log.4.txt
-rw-r--r-- 1 root root    1048535 Jul  7 01:22 complex_guid_log.5.txt
-rw-r--r-- 1 root root        238 Jul  7 01:22 synprof_parser_log.txt
-rw-r--r-- 1 root root         89 Jul  7 01:22 event_triggered_logger.log
-rw-r--r-- 1 root root        312 Jul  7 01:22 shared_lib_log.txt
-rw-r--r-- 1 root root          0 Jul  7 01:22 perf_lib_sif_log.txt
-rw-r--r-- 1 root root          0 Jul  7 01:21 graph_compiler_perf.log

In a multi-card case, logs for each card are saved in separate subdirectories under a corresponding number in the habana_logs directory .Output example when running on two cards:

$ ls -la ~/.habana_logs/0/
total 16
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler_perf.log
-rw-r--r-- 1 root root   70 Jul  7 02:28 hcl.log
-rw-r--r-- 1 root root  203 Jul  7 02:29 hcl_coordinator.log
-rw-r--r-- 1 root root  514 Jul  7 02:28 scal_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shared_lib_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_core_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_plugins_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_log.txt
-rw-r--r-- 1 root root 1125 Jul  7 02:28 synapse_runtime.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_utils_log.txt

$ ls -la ~/.habana_logs/1/
total 12
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 graph_compiler_perf.log
-rw-r--r-- 1 root root   70 Jul  7 02:28 hcl.log
-rw-r--r-- 1 root root  514 Jul  7 02:28 scal_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shared_lib_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_core_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 shim_plugins_log.txt
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_log.txt
-rw-r--r-- 1 root root 1125 Jul  7 02:28 synapse_runtime.log
-rw-r--r-- 1 root root    0 Jul  7 02:28 synapse_utils_log.txt

Generating Logs on Multi-Server Setup

When running the workloads on multiple nodes, some files might be overwritten by different nodes when they share the home file system. Therefore, it is recommended to set the logging location to a node’s local disk as shown in the example below:

$ export HABANA_LOGS=<path-to-node-local-disk>/.habana_logs
$ export LOG_LEVEL_ALL=0
$ # Train your model as usual

Using Logs Environment Variables

The following sections provide details on using Intel Gaudi logs environment variables. These variables can be used to configure the logging information and set its output location.

Using Log Levels

You can generate Intel Gaudi logs for different system levels and components by using LOG_LEVEL_ALL environment variable.

The value of LOG_LEVEL_ALL=[log level] sets the logging level for all components including the Intel Gaudi PyTorch bridge. The log levels are outlined in the below table:

0

Trace

Log everything including traces of progress.

1

Debug

Log all errors, warnings and all information useful for debugging.

2

Info

Log errors, warnings and some informative messages.

3

Warning

Log all errors and warnings.

4

Error

Log all errors.

5

Critical

Log only critical errors.

6

Off

Log nothing.

Example:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5 # all components log only critical errors
$ # Train your model as usual

Using Component-level Logs

The value of LOG_LEVEL_[component]=[log level] sets the logging level for a specific component. The component-level logs are outlined in the below table:

Intel Gaudi Software API

SYN_API

Profiling Subsystem

  • SYN_PROF

  • PROF_hl[0-7]

  • HLPROF

Graph Compiler

  • PARSER

  • GC

  • GRAPH_DATA

Intel Gaudi Performance Library

PERF_LIB

Habana Communication Library

  • HCL

  • HCL_SUBMISSIONS

Example:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL=5 # all components log only critical errors
$ export LOG_LEVEL_SYN_API=3 # all errors and warnings are logged for SYN_API
$ # Train your model as usual

Generating PyTorch Logs

You can set LOG_LEVEL_ALL_PT environment variable to obtain Intel Gaudi PyTorch bridge level logs. In case LOG_LEVEL_ALL_PT is not set, LOG_LEVEL_ALL is used instead.

Example:

$ export HABANA_LOGS=~/.habana_logs
$ export LOG_LEVEL_ALL_PT=[log level]
$ # Train your model as usual

Refer to the Runtime Flags section for a full description of the PyTorch environment variables.

SYN_API Error Codes

When making calls directly to SYN_API, it is useful to check the return codes against the following symbolic or integer values to understand the outcome of the operation.

synSuccess

0

The operation succeeded.

synInvalidArgument

1

An argument was invalid.

synCbFull

2

The command buffer is full.

synOutOfHostMemory

3

Out of host memory.

synOutOfDeviceMemory

4

Out of device memory.

synObjectAlreadyInitialized

5

The object being initialized is already initialized.

synObjectNotInitialized

6

The object must be initialized before the operation can be performed.

synCommandSubmissionFailure

7

The command buffer could not be submitted.

synNoDeviceFound

8

No Intel Gaudi device was found.

synDeviceTypeMismatch

9

The operation is for the wrong device type.

synFailedToInitializeCb

10

The command buffer failed to initialize.

synFailedToFreeCb

11

The command buffer could not be freed.

synFailedToMapCb

12

The command buffer could not be mapped.

synFailedToUnmapCb

13

The command buffer could not be unmapped.

synFailedToAllocateDeviceMemory

14

Device memory could not be allocated.

synFailedToFreeDeviceMemory

15

Device memory could not be freed.

synFailedNotEnoughDevicesFound

16

A free device could not be found.

synDeviceReset

17

The operation failed because the device is being reset.

synUnsupported

18

The requested operation is not supported.

synWrongParamsFile

19

While loading a recipe, the binary parameters file failed to load.

synDeviceAlreadyAcquired

20

The referenced device is already occupied.

synNameIsAlreadyUsed

21

A tensor with the same name has already been created.

synBusy

22

The operation failed to complete within the timeout period.

synAllResourcesTaken

23

The event could not be created due to lack of resources.

synUnavailable

24

The time an event finished could not be retrieved.

synInvalidTensorDimensions

25

High-rank tensor is attached to a node that does not support it.

synFail

26

The operation failed.

synOutOfResources

27

The operation failed due to lack of software memory.

synUninitialized

28

Intel Gaudi software library was not initialized before accessing it.

synAlreadyInitialized

29

The initialization failed because it was already initialized.

synFailedSectionValidation

30

The Launch operation failed because of section mismatch.

synSynapseTerminated

31

Intel Gaudi software cannot process the operation since it is going to be terminated.

synAssertAsync

32

The operation failed due to an assert-async event.

synInvalidEventHandle

33

Invalid event handle.

synMappingNotFound

34

MMU mapping for the given address is missing.

synFailedDynamicPatching

35

Dynamic patching failed.

synFailedStaticPatching

36

Static patching failed.

synFailedToSubmitWorkload

37

The workload could not be submitted.

synInvalidSectionsDefinition

38

Invalid tensors’ addresses or missing sections’ address.

synInvalidTensorProperties

39

Invalid tensor properties.

synFailHccl

40

HCCL failed.

synFailedToCollectTime

41

Time collection failed.

synTimeout

42

The operation failed due to driver timeout.

synResourceBadUsage

43

The resource is not used properly.