hl_qual Expected Output and Failure Debug

Expected Output

The hl_qual generates a test report. The test report is printed to the screen as well as to a log file. The test report log file naming convention is ServerName_hl_qual_report_TIMESTAMP.log. For example, k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log.


Figure 34 hl_qual Test Report

Fig. 34 shows an example of an hl_qual report.

For more information about HL-QUAL report, refer to hl_qual Report Structure.

Failure Debug

Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. It is recommended to follow a test plan where testing the basic HW components like PCI, SERDES using the simplified test goes first and and only then moving on to more complex tests such as power stress and functional tests.

Fig. 35 shows the recommended test plan using hl_qual tool.


Figure 35 hl_qual Test Plan

Habana recommends executing long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for 12 hours could expose cooling problems and overheating issues.

In case of test failures, generating log files is recommended.

Generating Log Files

As part of the test report, you can generate the reports listed below and send the reports to Habana for further support.

Dmesg report

  • demsg report - run the hl_qual with -dmesg switch will append the demsg collected during the duration of the test to the hl_qual report:

./hl_qual -gaudi -c all -rmod serial -t 5 -p -b -dmesg

hl-smi log - hl-smi - To generate the report, run the following command:

hl-smi -q > hl-smi.log

hl_qual log

  • Test plugin log - To enable hl_qual logger, use the environment variables:

ENABLE_CONSOLE=true LOG_LEVEL_QUAL=0 ./hl_qual -gaudi -c all -rmod parallel -t 30 -f -serdes_type allgather | tee hl_qual.log

The variable ENABLE_CONSOLE enables log printout to the standard output. If used with false, the log will be printed as a log file under the hl_qual folder.

hl_qual has 5 different log levels:

  • NO_LOG = 4

  • ERROR = 3

  • WARNING = 2

  • DEBUG = 1

  • INFO = 0

Synaps log

  • Synapse logs - As SynapseAI code layer is one of the major building blocks used by hl_qual in case of reported error the user is advised to run hl_qual with SynapseAI logs enabled, To enable hl_qual logger, which is done using the following environment variables:

    • ENABLE_CONSOLE=true - log printout is sent to the screen, if not used the log will be redirected $HOME/.habana_logs folder

    • LOG_LEVEL_ALL=X - where X is an integer number between 0..5, when using log level below 4 could cause a big performance drop in hl_qual.

ENABLE_CONSOLE=true LOG_LEVEL_ALL=4 ./hl_qual -gaudi -c all -rmod parallel -t 30 -f -serdes_type allgather | tee hl_qual.log

Extracting lspci report:

  • lspci report - Use the following command line:

lspci -vvnn

Debugging Specific Issues

  • PCI bandwidth issues:

    • Verify that the path between host to device including PCI bridges are Gen3 with x16 width.

    • Verify the correct Numa node assignment in the hl_qual report.

    • Verify correct setup of PCI retimers between host and device.

  • Serdes issues:

    • Check that all links are up (appears in the dmesg).

    • Check for any assertions due to wait loops.

  • Power stress issues:

    • Check clock throttling followed by device reset.

    • Check reported temperature and compare to allowed max temperature.