Qual Package Installation Validator

The Qual package installation validator ensures that the Qual package is correctly installed and ready to run the Qual tests. Note that this script validates the the Qual package only.

Validation Areas

The Qual package depends on the installation and configuration of several components. The list below outlines the different parts validated by this script:

  • Installed Intel SW package - Verifies that the following components are installed and ensures they are all from the same version:

    • Firmware - Reads the device’s flashed firmware and compares it to the version installed on the host.

    • Firmware-ODM

    • Firmware tools

    • Linux Kernel Mode Driver

    • RDMA_CORE

    • HL-THUNK

    • Graph compiler

    • Qual-Workloads

    • Qual

  • External libraries: - MPIRUN - LS-SENSORS

  • Dynamically Linked Library Integrity - Verifies that all required shared object (SO) files for Qual can be loaded and ensures there are no missing files.

  • Environment Variables - Ensures all necessary environment variables are defined to support Qual’s operation.

  • Host Memory Configuration - Checks the amount of host memory, huge pages allocation, and shared memory status, confirming there are no residual artifacts from previous runs.

  • Host CPU Governor Status - Verifies the status of the host CPU governor.

  • Basic Device Health and Operational Status - Assesses the health and operational status of essential devices.

Switches and Usage

The qual_pckg_validator.py script can be found under /opt/habanalabs/qual/diag_tool/automation. The following is a run command example:

 python qual_pckg_validator.py --core <gaudi2 | gaudi3> --system_type <server | standalone> --num_of_devices <0..7> --output <path of the report>


 options:
-h, --help            show this help message and exit
--core_type CORE_TYPE
                      Core type (e.g., gaudi2, gaudi3)
--system_type SYSTEM_TYPE
                      System type (e.g., server, standalone)
--num_of_devices NUM_OF_DEVICES
                      Number of devices
-o OUTPUT, --output OUTPUT
                      Output file path for the environment report

Options:

Option

Description

--core_type <core type>

Core type can be gaudi2 or gaudi3.

--system_type <system>

The system can be server or standalone. Server contains up to 8 cards with internal SERDES inetconnects, while the standalone usually is a collection of cards without any connectivity between them.

--num_of_devices <n>

Number of devices in the system. This value is needed to ensure that all devices are operational.

----output <path>

Full path, including the file name. This will be used as the report name.

After running the script, a textual report will appear in your terminal.

Example textual report:

---
'package-test-status': 'passed'
'lib-dependency-test-status': 'passed'
'env-vars-test-status': 'passed'
'python-libs-test-status': 'passed'
'host-mem-test-status': 'failed'
'shared-mem-test-status': 'passed'
'operational-test-status': 'Passed'
'cpu-governor-test-status': 'passed'
'reports':
- 'missing-packages-status': 'passed'
'version-test-status': 'passed'
'packages-versions':
- 'package_name': 'habanalabs-container-runtime'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-dkms'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-firmware-odm'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-firmware-tools'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-firmware'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-graph'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-qual-workloads'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-qual'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-rdma-core'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-tests'
    'version': '1.20.1-69'
- 'package_name': 'habanalabs-thunk'
    'version': '1.20.1-69'
- 'package_name': 'habanatools'
    'version': '1.20.1-69'
- 'lib-dependency-test': 'passed'
'bin-path': '/opt/habanalabs/qual/gaudi3/bin'
'tested-files':
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libconcurrency_edp.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/hbm_inject_error'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/backtrace_debug'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libhbm_plugin_gaudi2.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libconcurrency_e2e.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libpci_bw_plugin.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/read_nics_status'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libconcurrency_powertest.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/runner'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libfunctional_test_plugin.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libNIC_basetest_plugin.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libmemory_bw_plugin.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libtraining_plugin.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/iocache_loader'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/extractApp'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/hl_qual'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/hbm_interrupts'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/pcie_aer_detector'
- 'file-name': '/opt/habanalabs/qual/gaudi3/bin/libser_plugin.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libcoral_core_gaudi3.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_bmon_parser_lib.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libarc_core_g3.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_host_pcie_driver.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libcoral_infra.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_DMATests.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_pcie_driver.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_SivalTpcElfReader.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_test_core_for_device_runtime.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libcoral_user_gaudi3.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_sival_tpc_kernels.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libSynapseMmeReference.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_NICTests.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libarcbp.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_sival_gaudi3_mme_lib.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libpsoc_g3.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_sival_concurrency_lib.so_BCK'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_rottweiler_testlib.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libmme_test_gaudi3_lib.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_logger.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_device_runtime.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_sival_concurrency_lib.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_sival_tpc_tests_core.so'
- 'file-name': '/opt/habanalabs/qual/gaudi3/lib/libhost2_sival_tpc_tests_core_ext.so'
- 'file-name': '/opt/habanalabs/qual/lib/libdevice_runtime.so'
- 'file-name': '/opt/habanalabs/qual/lib/libsival_tpc_kernels.so'
- 'file-name': '/opt/habanalabs/qual/lib/libTpcElfReader.so'
- 'file-name': '/opt/habanalabs/qual/lib/liblogger.so'
- 'file-name': '/opt/habanalabs/qual/lib/libbmon_parser_lib.so'
- 'file-name': '/opt/habanalabs/qual/lib/librottweiler_testlib.so'
- 'file-name': '/opt/habanalabs/qual/lib/libsival_tpc_tests_core_ext.so'
- 'file-name': '/opt/habanalabs/qual/lib/libtpc_tests_core_ext.so'
- 'file-name': '/opt/habanalabs/qual/lib/libDMATests.so'
- 'file-name': '/opt/habanalabs/qual/lib/libtpc_numerics.so'
- 'file-name': '/opt/habanalabs/qual/lib/libtest_core_for_device_runtime.so'
- 'file-name': '/opt/habanalabs/qual/lib/libsival_concurrency_lib.so'
- 'file-name': '/opt/habanalabs/qual/lib/libtpcsim_shared.so'
- 'file-name': '/opt/habanalabs/qual/lib/libtpc_tests_lib.so'
- 'file-name': '/opt/habanalabs/qual/lib/libSivalTpcElfReader.so'
- 'file-name': '/opt/habanalabs/qual/lib/libpcie_driver.so'
- 'file-name': '/opt/habanalabs/qual/lib/libNICTests.so'
- 'file-name': '/opt/habanalabs/qual/lib/libsival_tpc_tests_core.so'
- 'file-name': '/opt/habanalabs/qual/lib/libhost_pcie_driver.so'
- 'env-test': 'passed'
'test-vars':
- 'name': 'HABANALABS_HLTHUNK_TESTS_BIN_PATH'
    'status': 'passed'
    'reason': 'points to: /opt/habanalabs/src/hl-thunk/tests'
- 'name': 'HABANA_LOGS'
    'status': 'passed'
    'reason': 'fully writable by user space'
- 'name': 'RDMA_CORE_LIB'
    'status': 'passed'
    'reason': 'points to  /opt/habanalabs/rdma-core/src/build/lib'
- 'name': 'GC_KERNEL_PATH'
    'status': 'passed'
    'reason': 'points to  /usr/lib/habanalabs/libtpc_kernels.so'
- 'name': 'HABANA_SCAL_BIN_PATH'
    'status': 'passed'
    'reason': 'points to  /opt/habanalabs/engines_fw'
- 'python-libs-test': 'passed'
- &id001
'host-mem-status': 'passed'
'host-hugepages-status': 'failed'
'host-mem-size': '2113232068'
'host-hugepages-num': '24641'
- &id002
'shared-mem-status': 'passed'
- &id003
'device-identification-test':
    'status': 'Passed'
    'operational-devices':
    - '0000:19:00.0': 'Operational'
    - '0000:9b:00.0': 'Operational'
    - '0000:bb:00.0': 'Operational'
    - '0000:3b:00.0': 'Operational'
    - '0000:cb:00.0': 'Operational'
    - '0000:4c:00.0': 'Operational'
    - '0000:db:00.0': 'Operational'
    - '0000:5d:00.0': 'Operational'
'device-serial-report':
    'status': 'Passed'
'device-mem-report':
    'status': 'Passed'
'device-power-report':
    'status': 'Passed'
'device-clock-report':
    'status': 'Passed'
'device-fw-report':
    'status': 'Passed'
- 'cpu-governor-test': 'passed'
...