hl_qual Report Structure

hl_qual generates a test report composed of sub-reports. The name of the report file, for example k501-u18-001-dev_hl_qual_report_Sat_Dec_4_09-15-16_2021.log, includes the tested server name, the string hl_qual_report and a timestamp with the date and time.

The hl_qual reports and log files are printed to a directory that is determined by the $HABANA_LOGS environment variable, using the $HABANA_LOGS/qual path. If HABANA_LOGS is not defined, hl_qual will set the path locally to /var/log/habana_logs and redirect the file printout to /var/log/habana_logs/qual.

Device Identification Report

This report contains the PCI bus ID of all identified devices according to the device identification switch entered, (for example, -gaudi). It contains device status reports that verify if the device is in operational state. If hl_qual finds that a certain device is not in operation state, the test will not be executed.

../../_images/device_indentification_report.PNG

Figure 8 Device Identification Report

hl-smi Short Report

The hl-smi report provides an identification card for all available devices including their bus_id, serial number, device index, module ID and device type.

../../_images/hl_smi_short_report.JPG

Figure 9 hl-smi Short Report

Operational Status Report

The operational status report contains the results of the operational test conducted on all detected Gaudi devices within the system. A device will fail the test if it does not meet the following criteria:

  • Memory usage exceeds the idle time memory usage threshold.

  • The operational indication, as set by the Intel Gaudi Linux kernel driver, is either unavailable or indicates that the device is not operational.

../../_images/operatinal_status_report.JPG

Figure 10 Operational Status Report

NUMA Node Report

The report contains the identified NUMA nodes, CPU sets and allocation of Gaudi devices per NUMA node. If the tested server contains a single NUMA node, the NUMA node allocation considerations in CPU to device allocation will not exist.

Note

When running on a virtual machine, the NUMA node data is usually not reflected correctly between the bare metal machine and the VM.

../../_images/numa_node_device_allocation.PNG

Figure 11 NUMA Node Report

hl_qual Version and Command Line Report

Reports the hl_qual package version and specifies the command line used.

../../_images/command_line_report.JPG

Figure 12 Command Line Report

Tested Device Report

The report contains the following information:

  • The specific data of the device: serial number, PCB assembly version, device name.

  • The time the test starts and stops.

  • Internal test plugin data accumulated during the test run, such as pass/fail data, general test stages.

../../_images/device_test_data.JPG

Figure 13 Device Test Report

Closing Report

The report contains the following items:

  • General statistics and metrics report gathered during the duration of the test, such as power usage, clocks and temperature.

  • Pass/fail report per tested device.

  • General pass/fail report. As a result, all tests should pass on all devices.

../../_images/closing_report.JPG

Figure 14 Closing Report