hl_qual Expected Output and Failure Debug

Expected Output

hl_qual generates a test report which can be printed to the screen or in a log file. The test report log file naming convention is ServerName_hl_qual_report_TIMESTAMP.log (for example, k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log). Fig. 15 shows an example of an hl_qual report.

../../_images/hl_qual_report_example.jpg

Figure 15 hl_qual Test Report

For more information on the hl_qual test report, refer to hl_qual Report Structure.

Failure Debug

Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. Testing basic HW components, such as PCI or Serdes, should be done first, followed by more complex tests such as power stress and functional tests.

Fig. 16 shows the recommended test plan using the hl_qual tool.

../../_images/hl_qual_test_plan.jpg

Figure 16 hl_qual Test Plan

It is recommended to execute long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for more than 20 minutes may expose cooling problems and overheating issues.

Note

In case of test failures, generating log files is recommended.

Generating Log Files

As part of the test report, you can generate the reports listed below and send the reports to Intel Gaudi for further support:

Log File

Description

demsg report

Running hl_qual with -dmesg switch appends the demsg collected throughout the duration of the test to the hl_qual report:

./hl_qual -gaudi -c all -rmod serial -t 5 -p -b -dmesg

hl-smi

To generate the hl-smi report, run the following command:

hl-smi -q > hl-smi.log

Test plugin log (hl_qual.log)

To enable hl_qual.log, use the following environment variables:

ENABLE_CONSOLE=true LOG_LEVEL_QUAL=0 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type allgather
| tee hl_qual.log

The variable ENABLE_CONSOLE enables a log printout to the standard output. If used with false, the log will be printed as a log file under the hl_qual folder.

hl_qual has 5 different log levels:

  • NO_LOG = 4

  • ERROR = 3

  • WARNING = 2

  • DEBUG = 1

  • INFO = 0

Synapse logs (hl_qual.log)

As the software code layer is one of the major building blocks used by hl_qual, in case of reported errors, it is highly recommended to run hl_qual with the enabled logs. To enable hl_qual.log, use the following environment variables:

  • ENABLE_CONSOLE=true - Log printout is sent to the screen. If not used, the log will be redirected to the $HOME/.habana_logs folder.

  • LOG_LEVEL_ALL=X - Where X is an integer number between 0..5. Using log level below 4 could cause a big performance drop in hl_qual.g.

ENABLE_CONSOLE=true LOG_LEVEL_ALL=4 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type
allgather | tee hl_qual.log

lspci report

To generate the lspci report, run the following command:

lspci -vvnn

Debugging Specific Issues

Test Plugin

Description

PCI bandwidth

  • Verify that the path between host to device including PCI bridges are Gen-3 with x16 width.

  • Verify the correct NUMA node assignment in the hl_qual report.

  • Verify correct setup of PCI retimers between host and device.

Serdes base

  • Check that all links are up (appears in the dmesg).

  • Check for any assertions due to wait loops.

Power stress

  • Check if the clock throttling is followed by device reset.

  • Check the reported temperature and compare to the allowed max temperature.