hl_qual Expected Output and Failure Debug

Expected Output

The hl_qual generates a test report which can be printed to the screen as well as to a log file. The test report log file naming convention is ServerName_hl_qual_report_TIMESTAMP.log (for example, k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log). Fig. 33 shows an example of an hl_qual report.

../../_images/hl_qual_report_example.jpg

Figure 33 hl_qual Test Report

For more information about hl_qual test report, refer to hl_qual Report Structure.

Failure Debug

Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. It is recommended to follow a test plan where testing the basic HW components such as PCI or Serdes using the simplified test goes first and and only then moving on to more complex tests such as power stress and functional tests.

Fig. 34 shows the recommended test plan using hl_qual tool.

../../_images/hl_qual_test_plan.jpg

Figure 34 hl_qual Test Plan

Habana recommends executing long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for 12 hours could expose cooling problems and overheating issues.

In case of test failures, generating log files is recommended.

Generating Log Files

As part of the test report, you can generate the reports listed below and send the reports to Habana for further support.

Log File

Description

demsg report

Running the hl_qual with -dmesg switch appends the demsg collected during the duration of the test to the hl_qual report:

./hl_qual -gaudi -c all -rmod serial -t 5 -p -b -dmesg

hl-smi

To generate the report, run the following command:

hl-smi -q > hl-smi.log

Test plugin log (hl_qual log)

To enable hl_qual logger, use the following environment variables:

ENABLE_CONSOLE=true LOG_LEVEL_QUAL=0 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type allgather
| tee hl_qual.log

The variable ENABLE_CONSOLE enables log printout to the standard output. If used with false, the log will be printed as a log file under the hl_qual folder.

hl_qual has 5 different log levels:

  • NO_LOG = 4

  • ERROR = 3

  • WARNING = 2

  • DEBUG = 1

  • INFO = 0

Synapse logs

As SynapseAI code layer is one of the major building blocks used by hl_qual, in case of reported errors it is highly recommended to run hl_qual with SynapseAI logs enabled. To enable hl_qual logger, using the following environment variables:

  • ENABLE_CONSOLE=true - Log printout is sent to the screen. If not used, the log will be redirected to the $HOME/.habana_logs folder.

  • LOG_LEVEL_ALL=X - Where X is an integer number between 0..5. Using log level below 4 could cause a big performance drop in hl_qual.g

ENABLE_CONSOLE=true LOG_LEVEL_ALL=4 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type
allgather | tee hl_qual.log

lspci report

Use the following command:

lspci -vvnn

Debugging Specific Issues

  • PCI bandwidth issues:

    • Verify that the path between host to device including PCI bridges are Gen3 with x16 width.

    • Verify the correct Numa node assignment in the hl_qual report.

    • Verify correct setup of PCI retimers between host and device.

  • Serdes issues:

    • Check that all links are up (appears in the dmesg).

    • Check for any assertions due to wait loops.

  • Power stress issues:

    • Check clock throttling followed by device reset.

    • Check reported temperature and compare to allowed max temperature.