hl_qual Expected Output and Failure Debug
On this Page
hl_qual Expected Output and Failure Debug¶
Expected Output¶
The hl_qual generates a test report which can be printed to the screen as well as to a log file.
The test report log file naming convention is ServerName_hl_qual_report_TIMESTAMP.log
(for example,
k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log
). Fig. 19 shows an example of an hl_qual report.

Figure 19 hl_qual Test Report¶
For more information about hl_qual test report, refer to hl_qual Report Structure.
Failure Debug¶
Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. Testing the basic HW components, such as PCI or Serdes, should be done first, followed by more complex tests such as power stress and functional tests.
Fig. 20 shows the recommended test plan using hl_qual tool.

Figure 20 hl_qual Test Plan¶
It is recommended to execute long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for more than 20 minutes may expose cooling problems and overheating issues.
Note
In case of test failures, generating log files is recommended.
Generating Log Files¶
As part of the test report, you can generate the reports listed below and send the reports to Habana for further support.
Log File |
Description |
---|---|
|
Running the hl_qual with ./hl_qual -gaudi -c all -rmod serial -t 5 -p -b -dmesg
|
|
To generate the report, run the following command: hl-smi -q > hl-smi.log
|
|
To enable hl_qual logger, use the following environment variables: ENABLE_CONSOLE=true LOG_LEVEL_QUAL=0 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type allgather
| tee hl_qual.log
The variable hl_qual has 5 different log levels:
|
|
As SynapseAI code layer is one of the major building blocks used by hl_qual, in case of reported errors, it is highly recommended to run hl_qual with the enabled SynapseAI logs. To enable hl_qual logger, use the following environment variables:
ENABLE_CONSOLE=true LOG_LEVEL_ALL=4 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type
allgather | tee hl_qual.log
|
|
lspci -vvnn
|
Debugging Specific Issues¶
PCI bandwidth issues:
Verify that the path between host to device including PCI bridges are Gen3 with x16 width.
Verify the correct Numa node assignment in the hl_qual report.
Verify correct setup of PCI retimers between host and device.
Serdes issues:
Check that all links are up (appears in the
dmesg
).Check for any assertions due to wait loops.
Power stress issues:
Check if the clock throttling is followed by device reset.
Check the reported temperature and compare to the allowed max temperature.