hl_qual Expected Output and Failure Debug
On this Page
hl_qual Expected Output and Failure Debug¶
Expected Output¶
hl_qual generates a test report which can be printed to the screen or in a log file.
The test report log file naming convention is ServerName_hl_qual_report_TIMESTAMP.log
(for example,
k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log
). Fig. 15 shows an example of an hl_qual report.
For more information on the hl_qual test report, refer to hl_qual Report Structure.
Failure Debug¶
Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. Testing basic HW components, such as PCI or Serdes, should be done first, followed by more complex tests such as power stress and functional tests.
Fig. 16 shows the recommended test plan using the hl_qual tool.
It is recommended to execute long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for more than 20 minutes may expose cooling problems and overheating issues.
Note
In case of test failures, generating log files is recommended.
Generating Log Files¶
As part of the test report, you can generate the reports listed below and send the reports to Intel Gaudi for further support:
Log File |
Description |
---|---|
|
Running hl_qual with ./hl_qual -gaudi -c all -rmod serial -t 5 -p -b -dmesg
|
|
To generate the hl-smi -q > hl-smi.log
|
Test plugin log
( |
To enable ENABLE_CONSOLE=true LOG_LEVEL_QUAL=0 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type allgather
| tee hl_qual.log
The variable hl_qual has 5 different log levels:
|
Synapse logs
( |
As the software code layer is one of the major building blocks used
by hl_qual, in case of reported errors, it is highly recommended
to run hl_qual with the enabled logs. To enable
ENABLE_CONSOLE=true LOG_LEVEL_ALL=4 ./hl_qual -gaudi -c all
-rmod parallel -t 30 -f -serdes_type
allgather | tee hl_qual.log
|
|
lspci -vvnn
|
Debugging Specific Issues¶
Test Plugin |
Description |
---|---|
PCI bandwidth |
|
Serdes base |
|
Power stress |
|