hl_qual Expected Output and Failure Debug
On this Page
hl_qual Expected Output and Failure Debug¶
The hl_qual generates a test report which can be printed to the screen as well as to a log file.
The test report log file naming convention is
ServerName_hl_qual_report_TIMESTAMP.log (for example,
k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log). Fig. 19 shows an example of an hl_qual report.
For more information about hl_qual test report, refer to hl_qual Report Structure.
Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. Testing the basic HW components, such as PCI or Serdes, should be done first, followed by more complex tests such as power stress and functional tests.
Fig. 20 shows the recommended test plan using hl_qual tool.
It is recommended to execute long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for more than 20 minutes may expose cooling problems and overheating issues.
In case of test failures, generating log files is recommended.
Generating Log Files¶
As part of the test report, you can generate the reports listed below and send the reports to Habana for further support.
Running the hl_qual with
./hl_qual -gaudi -c all -rmod serial -t 5 -p -b -dmesg
To generate the report, run the following command:
hl-smi -q > hl-smi.log
To enable hl_qual logger, use the following environment variables:
ENABLE_CONSOLE=true LOG_LEVEL_QUAL=0 ./hl_qual -gaudi -c all -rmod parallel -t 30 -f -serdes_type allgather | tee hl_qual.log
hl_qual has 5 different log levels:
As SynapseAI code layer is one of the major building blocks used by hl_qual, in case of reported errors, it is highly recommended to run hl_qual with the enabled SynapseAI logs. To enable hl_qual logger, use the following environment variables:
ENABLE_CONSOLE=true LOG_LEVEL_ALL=4 ./hl_qual -gaudi -c all -rmod parallel -t 30 -f -serdes_type allgather | tee hl_qual.log
Debugging Specific Issues¶
PCI bandwidth issues:
Verify that the path between host to device including PCI bridges are Gen3 with x16 width.
Verify the correct Numa node assignment in the hl_qual report.
Verify correct setup of PCI retimers between host and device.
Check that all links are up (appears in the
Check for any assertions due to wait loops.
Power stress issues:
Check if the clock throttling is followed by device reset.
Check the reported temperature and compare to the allowed max temperature.