hl_qual Expected Output and Failure Debug
On this Page
hl_qual Expected Output and Failure Debug¶
Expected Output¶
The hl_qual generates a test report which can be printed to the screen as well as to a log file.
The test report log file naming convention is ServerName_hl_qual_report_TIMESTAMP.log
(for example,
k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log
). Fig. 33 shows an example of an hl_qual report.

Figure 33 hl_qual Test Report¶
For more information about hl_qual test report, refer to hl_qual Report Structure.
Failure Debug¶
Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. It is recommended to follow a test plan where testing the basic HW components such as PCI or Serdes using the simplified test goes first and and only then moving on to more complex tests such as power stress and functional tests.
Fig. 34 shows the recommended test plan using hl_qual tool.

Figure 34 hl_qual Test Plan¶
Habana recommends executing long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for 12 hours could expose cooling problems and overheating issues.
In case of test failures, generating log files is recommended.
Generating Log Files¶
As part of the test report, you can generate the reports listed below and send the reports to Habana for further support.
Log File |
Description |
---|---|
|
Running the hl_qual with ./hl_qual -gaudi -c all -rmod serial -t 5 -p -b -dmesg
|
|
To generate the report, run the following command: hl-smi -q > hl-smi.log
|
|
To enable hl_qual logger, use the following environment variables: ENABLE_CONSOLE=true LOG_LEVEL_QUAL=0 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type allgather
| tee hl_qual.log
The variable hl_qual has 5 different log levels:
|
|
As SynapseAI code layer is one of the major building blocks used by hl_qual, in case of reported errors it is highly recommended to run hl_qual with SynapseAI logs enabled. To enable hl_qual logger, using the following environment variables:
ENABLE_CONSOLE=true LOG_LEVEL_ALL=4 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type
allgather | tee hl_qual.log
|
|
lspci -vvnn
|
Debugging Specific Issues¶
PCI bandwidth issues:
Verify that the path between host to device including PCI bridges are Gen3 with x16 width.
Verify the correct Numa node assignment in the hl_qual report.
Verify correct setup of PCI retimers between host and device.
Serdes issues:
Check that all links are up (appears in the
dmesg
).Check for any assertions due to wait loops.
Power stress issues:
Check clock throttling followed by device reset.
Check reported temperature and compare to allowed max temperature.