4. Qualification Library Guide (hl_qual Tool)

4.1. Overview

This document provides information on the Habana Labs Qualification Tool (hl_qual) for Gaudi.

The hl_qual tools package provides the required qualification tools needed in order to qualify the usage and integration of Gaudi hardware platforms in your server design.

4.1.1. Applicable Documentation

For the full Gaudi installation procedure, please refer to the Installation Guide.

4.2. Package Content

The hl_qual tools package contains the applications (plugins) listed in the below table:

Module

Description

Type

hl_qual

Glue logic application that runs all test plugins located in /opt/habanalabs/qual/gaudi. See hl_qual Design.

Application

monitor

A standalone monitoring application to monitor multiple device execution metric measurement (device power usage, Device clock and Device temperature). See hl_qual Monitor Textual UI.

Application

libfunctional_ test_plugin.so

Dynamically linked library implementing the functional test plugin. See Functional Test.

Dynamically linked lib

libNIC_basetest _plugin.so

Dynamically linked library implementing the NIC base tests plugin. See Serdes Base Test.

Dynamically linked lib

libNIC_loopback_ plugin.so

Dynamically linked library implementing the Loopback test plugin and HCL simple API test. See Serdes Loopback Test.

Dynamically linked lib

libpci_bw_plugin.so

Dynamically linked library implementing the PCI bandwidth test plugin. See PCI Bandwidth Test Design Considerations and Requirements.

Dynamically linked lib

libpower_stress_plugin.so

Dynamically linked library implementing the Power stress and EDP test plugin. See Power_stress Plugin Design Consideration and Responsibilities.

Dynamically linked lib

libmemory_bw_plugin.so

Dynamically linked library implementing the Memory bandwidth plugin. See Memory Bandwidth Plugin Design Consideration and Responsibilities.

Dynamically linked lib

libhbm_stress_plugin.so

Dynamically linked library implementing a HBM stress plugin. See HBM Stress Plugin Design Consideration and Responsibilities

Dynamically linked lib

libtraining_plugin.so

Dynamically linked library implementing a complete ResNet-50 training stress test. See ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities

Dynamically linked lib

training64.json training256.json validation64.json validation256.json

ResNet-50 training stress test plugin configuration JSON file for Aeon library data loader.

Aeon config file

hls1.json

HCL library configuration JSON file for HLS1 Server system.

HCL config file

hl_qual.ini

INI hl_qual configuration file which defines which test plugin will be loaded and the basic behavior of the monitor sampling.

INI format config file

config.ini

Test plugin configuration INI file. This file is brought as a usage example. Users may customize this file or generate a new one. See hl_qual ini Configuration File.

INI format config file

monitor.ini

INI configuration file to control the sampling Of the standalone monitor application and the hl_qual monitor add-on.

INI format config file

device.ini

INI configuration file to control Habanalabs device setup. See device ini Configuration File.

INI format config file

../_images/Qualification_Package_Content.jpg

Figure 4.1 Qualification Package Content

4.3. hl_qual Design

This section describes the general design of the hl_qual package and the responsibilities of each software component. Throughout this section, the term plugin will be used to describe a testing library, intended to perform a specific testing task - for example, power stress test plugin. Each plugin conforms to specific design considerations as further described in this section.

All plugins are implemented as dynamically linked libraries which have a common API.

As shown in Package Content, the hl_qual package is composed out of the the hl_qual application and plugin libraries specified in Figure 4.2. The hl_qual application is the portal for running all test plugins by forking a process per test plugin and per device.

../_images/The_Running_Hierarchy_Between_hl_qual_and_Plugins.jpg

Figure 4.2 The Running Hierarchy between hl_qual and Plugins

Both hl_qual and the test plugins run as separate processes. Only the pass/fail indication of a specific test is passed from the test plugin to the hl_qual when the plugin ends its execution of the test.

The hl_qual can run the test plugins on multiple Gaudi devices using one of either two possible running modes:

  • Serial mode – In this mode, the test plugins run one by one, with each test plugin starting once the test on a previous tested device is complete. A single Gaudi device is tested at a time. This running mode is recommended when running the PCI bandwidth test to calculate the peak achievable PCI bandwidth to/from the host to the device.

  • Parallel mode – In this mode, all test plugin processes are forked simultaneously and run in parallel, with each test plugin running on the Gaudi device allocated for it. The execution of the test plugin is not synchronized, which means the start and end execution of each plugin may occur at different times. The hl_qual application considers testing completed once the last running test plugin ends.

4.3.1. Pass/Fail Criteria Considerations

Each test plugin is responsible for setting and checking the pass/fail criteria. The hl_qual application receives a pass/fail indication from each test plugin but is not responsible for the interpretation of the test results. This allows the hl_qual application to have a similar interface to all the test plugins.

4.3.2. hl_qual Design and Responsibilities

The hl_qual application is the glue logic responsible for running multiple tests on multiple devices. The following list describes the tasks that the hl_qual executes while running these tests:

  • Identification of available Gaudi devices in the system.

  • Generation of a command line for the test plugins.

  • Running the test plugins on all available devices under the running modes described in this section.

  • Capturing system parameters for all available devices – temperature, clock, and power usage and displaying them on screen using the monitor add-on.

  • Waiting for test results from all running test plugins (Pass/Fail indications).

4.3.3. HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes resetting the Gaudi cards between iterations and testing for any interrupts during data transfers.

Run the test with superuser permissions and set the habanalabs drivers with the correct flags in order to use the plugin.

4.3.4. ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities

The ResNet-50 training stress test plugin runs a functional ResNet-50 training test.

4.3.4.1. ResNet-50 Training Stress Test Plugin Testing Modes

The ResNet-50 training stress test plugin has two testing modes, each with different batch size options:

  • Training tests:

    • 64 batch size

    • 256 batch size

  • Validation tests:

    • 64 batch size

    • 256 batch size

4.3.4.2. Pass/Fail Criteria

The test will fail if data is not received as expected, otherwise the test will succeed.

4.3.5. Memory Bandwidth Plugin Design Consideration and Responsibilities

The memory bandwidth test plugin is an hl_thunk based DMA bandwidth measurement test. The test includes PCI bandwidth and a variation of DRAM and SRAM memory transfer tests. The tests are built on top of the hlthunk API - a lower level API wrapping the Habana driver.

4.3.5.1. Memory Bandwidth Testing Modes

  • PCI tests:

    • HOST ==> DRAM

    • DRAM ==> HOST

    • DRAM <==> HOST

  • Device memory tests:

    • SRAM ==> DRAM

    • DRAM ==> SRAM

    • DRAM ==> DRAM

4.3.5.2. Pass/fail criteria

The PCI sub test is similar to the Synapse PCI load test pass/fail criteria:

  • Unidirectional download from host to device with an expected bandwidth of 11.6GB/s, assuming CPU with Gen-3 PCI link.

  • Unidirectional upload from device to host with an expected bandwidth of 12.9GB/s, assuming CPU with Gen-3 PCI link.

  • Bidirectional test which calculates the bandwidth on a simultaneous upload and download with an expected bandwidth of 19.9GB/s.

The calculated pass/fail criteria threshold are theoretical, hence the PCI test has a 10% allowable degradation.

4.3.6. Power_stress Plugin Design Consideration and Responsibilities

The power stress plugin does the following:

  • Conducts multi-level power stress test.

  • Conducts multi-level power EDP test.

The power level for both power stress and EDP tests are configurable via command line settings and aligned with the following levels:

  1. Extreme - measured power level: 340[watt]

  2. Mid – measured power level: 260[watt]

  3. Low – measured power level: 140[watt]

Note

The above numbers were achieved on an HL-205 device running at a max frequency of 1.95 GHz.

4.3.6.1. Power Stress Test

The power stress plugin running a power stress test puts the device in constant and equal level power load. The tests can run for long hours and test the following device functionalities:

  • Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms can be checked while running power stress plugin in extreme load.

  • Power limiter and clock relaxation mechanisms – The power limiter is a mechanism that limits the power usage below 300 [watts]. When the power limit is met, the device clocks are lowered. To test the power limiter mechanism, the plugin must run at an extreme power level.

  • Long work periods in typical power workloads (extreme, low).

When running this test, reaching a specific power load depends on external system conditions such as the number of devices used, ambient system temperature, cooling system design, and device placement within the rack.

Note

Reaching stable power may take time in the range of 30 seconds.

4.3.6.2. EDP Test

The EDP test verifies the functionality of the Gaudi power supply by generating a fast power usage transient from low power to high power and vice versa.

The power cycles repeat throughout the test’s execution time.

../_images/EDP_Test_Power_Cycles.PNG

Figure 4.3 EDP Test Power Cycles

The test’s configurable parameters are the different power levels. The power cycle consists of high-power level usage and low-power level usage (idle state) as shown in Figure 4.3. All configurable parameters are listed in EDP Stress Test Plugin Switches and Parameters.

4.3.6.3. Pass/fail Criteria

Both EDP and power stress tests must run until completion without overheating or power supply failures.

4.3.7. PCI Bandwidth Test Design Considerations and Requirements

The PCI bandwidth test plugin measures the PCI bandwidth when moving data from the host to/from the device HBM memory. The hl_qual can run this test using the two running mode specified in hl_qual Design.

When running this test in serial mode, each device should achieve maximal bandwidth.

The load PCI plugin pass/fail criteria for this mode is 11.04GB/s, assuming the host CPU port is PCIe Gen3 x16.

In parallel mode, there is no pass/fail criteria verification as test results may vary according to the customer’s platform design. The benchmark only reports achievable bandwidth per device.

The test runs are partitioned into three sub test upload/download/bidirectional. This imposes certain restrictions on PCI test time duration. For example, if you run the test for 20 seconds in serial mode on an 8 Gaudi machine the actual test duration will be:

  • Total_test duration = 20 * 3 * 8 = 240 seconds.

To average the bandwidth calculation, 10-20 seconds is sufficient (about 200 GB for upload/download). Running the test beyond 1 minute will increase the test duration.

4.3.7.1. Pass/fail Criteria

The PCI test plugin runs multiple tests to check the download and upload bandwidth with the following pass/fail criteria:

  • Unidirectional download from host to device with an expected bandwidth of 11.6GB/s assuming CPU with Gen-3 PCI link.

  • Unidirectional upload from device to host with an expected bandwidth of 12.9GB/s assuming CPU with Gen-3 PCI link.

  • Bidirectional test which calculates the bandwidth on a simultaneous upload and download with an expected bandwidth of 19.9GB/s.

The calculated pass/fail criteria threshold are theoretical, hence the PCI test has a 10% allowable degradation.

4.3.8. Serdes Base Test

The Serdes base test performs basic sanity tests on the Serdes, ensuring the following parts are tested:

  • Port connectivity

  • RX/TX data integrity

RX/TX data integrity is checked by pre-calculated random data and compared against a reference data. The default transmitted buffer size is 128MB to ensure a variety of transmitted data. During the test, the TX buffer is transmitted multiple times according to the test execution time you set. The test can check inter-box connectivity between the different devices using a pairs test. A pairs test takes all device pairs and runs a data integrity test on each pair of devices.

4.3.8.1. Serdes Base Test Testing Modes

The test plugin consists of the below sub-testing modes:

  1. Loopback test - Transmits and receives from the same HCL rank (device) by using a loopback dongle connected to all ports.

  2. External loopback - This test should be used when the port configuration has a few internal ports used to connect the different devices within the server box and external ports used to connect devices from different boxes. To run this test, the external ports must be fitted with loopback dongles while the internal ports should be disabled using the config INI. For further details, refer to hl_qual and Test Plugin Configuration Files.

  3. Pair test - This test runs over all available device pairs and tests connectivity. The test plugin receives an HCL library configuration json file that maps port connectivity between devices. See HCL JSON Config File Format.

  4. All-Gather - This test runs over all available devices and computes the bandwidth of all-gather functionality. The test plugin receives an HCL library configuration json file that maps port connectivity between devices. See HCL JSON Config File Format.

  5. All-Reduce - This test runs over all available devices and computes the bandwidth of all-reduce functionality. The test plugin receives an HCL library configuration json file that maps port connectivity between devices. See HCL JSON Config File Format.

4.3.8.2. Pass/fail Criteria

The pass/fail criteria is composed out of two sub-criteria:

  • Connectivity - The test fails if the destination rank does not respond.

  • The test fails if data is not received as expected, compared with the reference data.

4.3.9. Serdes Loopback Test

Serdes loopback test ensures Serdes functionality of each Gaudi device tested. The test closes a loop by using a terminating dongle.

The test can run in parallel on all available Gaudi devices in the system.

The simple loopback test is a non-failing test. It only reports the achievable bandwidth using the different HCL library interfaces - read/write vs send/receive. This test also enables the usage of stream synchronization. See Habana Communication Library (HCL) API Reference.

Note

The loopback test must be used with the loopback dongle, otherwise the test will fail.

4.3.10. Functional Test

The functional test runs all available hardware components on the Gaudi SOC to test the functionality and the interaction between the different units during parallel execution. When using parallel execution, the test plugin will run on all hardware components simultaneously.

The following are the tested units:

  • PCI links

  • DMA engines – moving data between:

    • PCI ==> HBM, HBM ==>PCI

    • HBM ==> SRAM, SRAM==>HBM

  • MME engines

  • TPC engines

  • Serdes

4.3.10.1. Functional Test Testing Modes

The functional test contains the following sub-test modes:

  1. Simple mode – The test runs a topology which checks the PCI, DMA, MME and TPC units. Serdes communication is not tested.

  2. LOOPBACK – On top of the simple topology, a Serdes loopback communication test was added. This test includes a verification topology that is executed on the device. When running this test mode, the device must be connected (RX to TX) using a loopback dongle.

  3. AllGather - This mode is built on top of the simple functional test. It enables an advanced AllGather Serdes test. During this test, each Gaudi device sends data to all other available devices in the server. The received data from all the Gaudi devices is verified using a predefined topology and compared with expected data to verify RX/TX integrity.

  • AllReduce - This mode is built on top of the simple functional test. It enables an advanced Allreduce Serdes test. During this test, each Gaudi device sends data to all other available devices in the server. Upon receiving all the messages from all other Gaudi devices, each device performs a reduction by summing up all messages together and placing the result into the RX matrix. The results matrix is verified using a verification topology running TPC code. The final result is sent to the host for verification.

The functional test plugin builds a test topology composed of the following computation nodes:

  • Large GEMM nodes

  • Large Matrix add nodes

  • Embedding sum nodes

  • Large sub nodes

  • Reduce_L1_norm nodes

The test application runs the topology on each test iteration by injecting pre-calculated inputs and compares the output against a pre-calculated reference.

4.3.10.2. Pass/fail Criteria

The pass/fail criteria is composed out of three sub-criteria:

  • The calculated value of each topology launch must be identical to a pre-calculated reference.

  • All RX/TX transmission received from other devices must be bit-exact to the expected matrix values.

  • The execution throughput [executions/seconds] must not fall below an existing predefined threshold.

4.4. hl_qual Monitor Textual UI

The monitor is a textual UI that enables the monitoring of Gaudi run parameters such as temperature, power usage, clock, ECC errors and more. The monitor also shows the test progress via a progress bar as well as the expected test time completion.

../_images/monitor.jpg

Figure 4.4 Monitor Textual UI Interface

You can disable the monitor screen printout by using -dis_mon switch. This option is important when you run the hl_qual in a scripting environment.

Note

Disabling the monitor will not stop parameter collection as these are needed for the hl_qual’s final test report. You may configure which parameters should be collected by configuring a monitor INI configuration file. For more information about monitor configuration file, refer to monitor ini Configuration File.

4.5. Running hl_qual for Gaudi

The hl_qual is a command line based tool where each test variant is run from a command line terminal. There are two means of passing parameters to the hl_qual and the other test plugins:

  • Command line switches and parameters.

  • Configuration file - The hl_qual SW package uses INI configuration files.

Whenever specific parameters can be delivered from both the command line and configuration file, the command line always overrides the configuration file if both options exist in that specific execution run.

4.5.1. hl_qual and Test Plugins Command Line Options

The hl_qual and test plugins switches and parameters are partitioned into three groups:

  • hl_qual switches and parameters - These parameters are consumed by the hl_qual process only and does not affect the test plugins.

  • Test plugins switches and parameter - These are consumed by the test plugin only and does not affect the hl_qual process.

  • Common parameters - These parameters and switches are consumed by both the hl_qual process and test plugin execution, for example -t.

The below sections further describe all switches and parameters according to the test plugins outlined in hl_qual Design.

4.5.1.1. Getting Help

The following command line prints out a usage help message on screen:

./hl_qual -h

The message includes specific hl_qual switches as well as the switches of all available and loaded plugins.

../_images/help_message.jpg

Figure 4.5 hl_qual and Plugin Usage Printout

All applicable switches are shown in the example below. Optional switches or parameters that are placed within square brackets.

hl_qual -gaudi -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
         <-f | -p | -e | -s | -slb | -nic_base | -trainingApp | -hbm_stress>  [-m] [-sz <size in bytes>] [-wr] [-b] [-n] [-size <size in bytes> ]
         [-serdes_type <none | loopback | allgather| allreduce>] [-j <allgather/allreduce json config path>] [-pl_cfg <plugin INI config path>]
         [-tw <time in seconds>] [-ts <time in seconds>] [-l <extreme | low>] <loopback | loopback_ext | pairs> [-l <extreme | mid | low>] [-enable_serr]

For optional switches and parameters, the sections below state the default values when these switches are not specified in the command line.

4.5.1.2. hl_qual Switches and Parameters

The following lists the hl_qual switches and parameters:

  • -gaudi - This switch indicates that a Gaudi device should be detected and used for testing.

./hl_qual -gaudi -c all -rmod parallel -t 20 -f

Note

IMPORTANT: This is a mandatory switch. hl_qual will issue an error if this switch is missing.

  • -dis_mon - This switch stops the monitor printout to the screen. This option is useful when running the hl_qual inside a script.

./hl_qual -gaudi -dis_mon -c all -rmod parallel -t 20 -f

Note

IMPORTANT: Data sampling of different values like power, clock and temperature is still carried out but not displayed in the user screen.

  • -mon_cfg <path to monitor config file> - This switch enables using a different monitor configuration file instead of the default monitor.ini.

./hl_qual -gaudi -c all -mon_cfg my_mon.ini -rmod serial -t 20 -p -b
  • -c <pci bus id> - This switch allows you to specify the Gaudi devices under test bus IDs. There are two applicable formats:

    • all - for example:

    ./hl_qual  -gaudi -c all -rmod serial -t 20 -b -p
    
    • comma delimited bus ID list - for example:

    ./hl_qual -gaudi -c 0000:07:00.0,0000:08:00.0 -rmod serial -t 20 -p -b
    

Note

IMPORTANT: In the comma delimited bus ID list, avoid using spaces between the comma and the bus ID strings.

  • -t <time in seconds> - This switch specifies the test duration in seconds. This switch is given to the hl_qual but also delivered to the plugin under test. When this switch is omitted, the default value is set by the plugin under test.

    • -rmod <running mode> - This switch specifies the running mode on the available Gaudi devices of the plugin under test. There are two applicable modes:

    • parallel - The plugin under test will be executed on all available devices at the same time. For example:

    ./hl_qual -gaudi -c all -rmod parallel -t 20 -f
    
    • serial - The plugin under test will be run on one device at a time.

    ./hl_qual -gaudi -c all -rmod serial -t 20 -f
    
    • -enable_serr - This switch enables hl_qual SERR counter check, which verifies that no single ECC error occurs while running the plugins. The hl_qual reads the SERR counter data from the journalctl logs. Since the journal is not accessible from a Docker environment, this switch can only be used from a BareMetal machine or a VM.

    ./hl_qual -gaudi -dis_mon -c all -rmod parallel -t 20 -f -enable_serr
    

4.5.1.3. HBM Stress Test Plugin Switches and Parameters

The following lists the HBM stress test plugin switches and parameters:

sudo rmmod habanalabs

sudo modprobe habanalabs timeout_locked=500

sudo -E hl_qual -c <pci bus id>  -rmod <parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
        -hbm_stress  [-i <number of iteration >] [-pl_cfg <plugin INI config path>]
  • -hbm_stress - HBM Stress test selector.

  • -i - Number of iterations for the test. Each iteration takes about 6 minutes.

  • -pl_cfg <plugin INI config path> - Specify test INI configuration file path.

The below example command runs the test for 2 iterations, for about 12 minutes.

sudo rmmod habanalabs

sudo modprobe habanalabs timeout_locked=500

sudo -E ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress

Note

IMPORTANT: Expected execution time depends on the number of iteration specified using the -i switch. Each iteration can take up to 6 minutes.

4.5.1.4. ResNet-50 Training Stress Test Plugin Switches and Parameters

The following lists the training test plugin switches and parameters:

**hl_qual -c <pci bus id> [-rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
        -trainingApp [-bs <batch size 64 | 256>] [-type <application type training | validation>] [-epoch <numbe of epochs>] [-pl_cfg <plugin INI config path>]**
  • -trainingApp - Training test plugin selector.

  • -bs <64 | 256> - Defines the training batch size.

    • 64 - Batch size 64

    • 256- Batch size 256

If the value is not specified, the default value is 256.

  • -type <training | validation> - Defines the type of data set.

    • training - training set

    • validation - validation set

If the value is not specified, the default value is training.

  • -epoch - Defines epoch count.

  • -log - Writes statistics to file.

  • -pl_cfg <INI config file path> - Enables specifying an INI configuration file to configure the training test plugin.

4.5.1.5. ResNet-50 Training Stress Test Plugin Configuration Files and Requirements

Before using the plugin, make sure to perform the following:

  • Install the AEON package available under ~/habanalabs/demos/aeon.

  • Download the training/validation data set from Imagenet.

  • Run the preparation script - prepare.sh. This will untar the Imagenet tar file and generate a training list file.

For each test case, the hl_qual repository includes a matching configuration file (for example, type=training and bs=64 -> training64.json). Edit the following attributes in the configuration file:

  • manifest_filename - Path to train_list.txt or val_list.txt file.

  • manifest_root - Path to the folder that contains the training directory (output of prepare.sh script).

The available configuration files are listed below. When editing the below configuration files, absolute paths must be used:

  • training256.json - Configuration file for training run with batch size 256.

  • training64.json - Configuration file for training run with batch size 64.

  • validation256.json - Configuration file for training run with batch size 256.

  • validation64.json - Configuration file for training run with batch size 64.

Note

IMPORTANT: Expected execution time depends on the number of epochs configured for the test run. Each epoch can take up to 18 minutes.

4.5.1.6. PCI Bandwidth Test Plugin Switches and Parameters

**hl_qual -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
        -p [-b] [-n] [-size <size in bytes> ] [-pl_cfg <plugin INI config path>]**
  • -p - PCI test plugin selector.

  • -b - Enables bidirectional PCI test, simultaneous upload and download test. By default, if this switch is not specified, the test plugin will perform only two upload and download bandwidth tests. When this switch is omitted, the default behavior is to skip the bidirectional test and perform an upload and download bandwidth check.

  • -n - Disables bandwidth checks. This option is useful when bandwidth calculation in parallel mode is required where all devices are simultaneously being tested. When this switch is omitted, the PCI bandwidth plugin will conduct a bandwidth validation check.

  • -size <buffer size in bytes> - Upload/Download buffer size specification. The minimal buffer size must be 536870912. If this switch is omitted, the default upload/download buffer size is 200MB.

    ./hl_qual -gaudi -c all -rmod serial -t 20 -p -b -size 102400000
    
  • -gen <gen modifier> - Specifies the expected PCI device generation. There are two applicable modifiers:

    • gen3 - The test is running on a Gen-3 PCI system (Host + Habana device).

    • gen4 - The test is running on a Gen-4 PCI system (Host + Habana device).

    ./hl_qual -gaudi -c all -rmod serial -t 20 -p -b -gen gen3
    ./hl_qual -gaudi -c all -rmod serial -t 20 -p -b -gen gen4
    

    If this switch is omitted, it is assumed that the system under test (HOST + Habana device) is a Gen-3 PCI data path.

  • -t - PCI test duration in seconds. If this switch is omitted, the default value is 40 seconds.

Note

Since the PCI bandwidth test plugin conducts up to 3 sub-tests (upload, download and bidirectional), the duration given in the command line should be multiplied by 3.

  • -pl_cfg <INI config file path> - Enables you to specify an INI configuration file to configure the PCI test plugin.

./hl_qual -c all -rmod serial -t 20 -p -b -pl_cfg config.ini

4.5.1.7. Power Stress Test Plugin Switches and Parameters

hl_qual -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
         -s [-l <extreme | low>] [-pl_cfg <plugin INI config path>]
  • -s - Power stress test selector.

  • -l <extreme | low> - Power level selector:

    • extreme - 340[w] measured on HL-205 running at 1.95 GHz

    • mid - 260 [w] measured on HL-205 running at 1.95 GHz

    • low - 140 [w] measured on HL-205 running at 1.95 GHz

    If the value is not specified, the default value is low.

  • -pl_cfg <plugin INI config path> - Path to an INI configuration file.

4.5.1.8. Memory BandWidth Test Plugin Switches and Parameters

hl_qual -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      [-pl_cfg <plugin INI config path>]  [-n] [-b]  [-memOnly | -pciOnly] -mb
  • -mb - Memory bandwidth test selector.

  • -n - Cancels checking pass fail criteria for bandwidth speed.

  • -b - Activates Bidirectional bandWidth tests.

  • -memOnly - Only activates device memory tests:

    • SRAM ==> DRAM

    • DRAM ==> SRAM

    • DRAM ==> DRAM

  • -pciOnly - Only activates PPC tests:

    • HOST ==> DRAM

    • DRAM ==> HOST

    • DRAM <==> HOST [only active with -b flag]

Note

pciOnly and memOnly are optional and cannot be active at the same time

  • -pl_cfg <plugin INI config path> - Path to an INI configuration file.

./hl_qual -gaudi -c all -rmod parallel -dis_mon -b -mb

The above command line executes the memory bandwidth with all the test and with a pass fail criteria for transfer speed.

4.5.1.9. EDP Stress Test Plugin Switches and Parameters

hl_qual -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
        -e  [-pl_cfg <plugin INI config path>]  [-tw <time in seconds>] [-ts <time in seconds>] [-l <extreme | low>]
  • -e - EDP test selector.

  • -l <extreme | low> - Power level selector:

    • extreme - 340[w] measured on HL-205 running at 1.95 GHz

    • mid - 260 [w] measured on HL-205 running at 1.95 GHz

    • low - 140 [w] measured on HL-205 running at 1.95 GHz

    If the value is not specified, the default value is low.

  • -tw <time in milliseconds> - Time duration of high power usage in the EDP test power cycle. The default value when this switch is not specified is 1000 [ms].

  • -ts <time in milliseconds> - Time duration of low power usage (idle mode) in the EDP test power cycle. The default value when this switch is not used is 1000 [ms].

Note

tw + ts must be smaller than the test execution time.

  • -pl_cfg <plugin INI config path> - Path to an INI configuration file.

./hl_qual -gaudi -c all -rmod parallel -t 40 -e -l extreme -tw 2000 -ts 2000

The above command line executes the EDP test for 40 seconds and runs 10 power cycles. Each power cycle runs 2 seconds of high power usage and 2 seconds of low power usage.

4.5.1.10. Standalone Loopback Test Plugin Switches and Parameters

hl_qual -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
        -slb  [-m] [-sz <size in bytes>] [-wr] [-pl_cfg <plugin INI config path>]
  • -slb - Standalone Serdes loopback test selector.

  • -m - Use Synapse stream for send/receive synchronization operations.

  • -wr - Use write/read API instead of send/receive API. read/write enables direct memory write to neighboring Gaudi HBM.

  • sz - Send/receive buffer size in bytes.

  • -pl_cfg <plugin INI config path> - Specify test INI configuration file path.

./hl_qual -gaudi -c all -rmod parallel -t 40 -slb -m

The above command line runs the test for 40 seconds using stream synchronization.

4.5.1.11. Serdes Base Test Plugin Switches and Parameters

hl_qual -c <pci bus id> [-t <time in seconds>] [-i <inner loop iterations count>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
        -nic_base  -test_type <loopback, ext_loopback pairs, ext_pairs> [-pl_cfg <plugin INI config path>] [-sz <size in bytes>] -j <HCL json config file> [-seed <seed>]
  • -nic_base - Serdes base test selector.

  • -test_type <test type - Defines the test type and port configuration to be used in the user system. The different test type variants are:

    • loopback - Loopback test. All device ports must be fitted with a loopback dongle.

    • ext_loopback - In this test all internal ports interconnected between Gaudi devices are disabled. The external ports going outside of the server box are fitted with loopback dongle. To disable the internal ports disabling, update the config.ini file. See Serdes Base Test Plugin Configuration Syntax for further details.

    • pairs - This test checks the internal port connectivity.

    • allgather - This test computes the all-gather bandwidth.

    • allreduce - This test computes the all-reduce bandwidth.

  • -j <json config file - Specifies the HCL library configuration file which defines the ports connectivity within the server box and which ports are external.

  • -sz - Send/receive buffer size in bytes.

  • -i - Iteration count For bandwidth computation (all-reduce and all-gather).

  • -pl_cfg <plugin INI config path> - Specifies test INI configuration file path.

  • --seed <seed> - Seed value for generating the transmit patterns. This is a 32 bit number with the following hexadecimal pattern xxxxxxxx (for example, abc123da). The default pattern is 5a5a5a5a5a.

./hl_qual -gaudi -c all -rmod parallel -t 40 -nic_base -test_type pairs -j hls1.json

The above command line runs the Serdes base test for 40 seconds. The HCL communication library uses the hls1.json file for port configuration.

4.5.1.12. Functional Test Plugin Switches and Parameters

hl_qual -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
        -f  [-serdes_type <none | loopback | allgather | allreduce>] [-j <allgather/allreduce json config path>] [-pl_cfg <plugin INI config path>]
  • -f - Functional test plugin selector.

  • -serdes_type <serdes test definition> - This test contains three sub-test modes:

    • none - Runs the simplified functional test without Serdes testing.

    ./hl_qual -gaudi -c all -rmod parallel -f -serdes_type none
    
    • loopback - Runs the simplified functional test including Serdes loopback test. This option may also be tested on external ports. You must connect loopback dongles to external ports and disable all internal ports. To disable internal ports, see Functional Test Plugin Configuration Syntax.

    ./hl_qual -gaudi -c all -rmod parallel -f -serdes_type loopback
    

    Note

    When running this test, all ports of the Gaudi device must be connected with a loopback dongle to close the RX/TX loop.

    • allgather - Runs the simplified functional test including the Serdes Allgather test. During each test iteration, each Gaudi device sends a data buffer of 16MB to all other devices participating in the test. All Gaudi devices check the received transmission against an expected reference input buffer.

    • allreduce - Runs the simplified functional test including the Serdes allreduce test. During each test iteration, each Gaudi device sends a data buffer of 16MB to all other devices participating in the test. All Gaudi devices check the received transmission against an expected reference input buffer.

    ./hl_qual -gaudi -c all -rmod parallel -f -serdes_type allgather
    

    Note

    This test can be performed only on HLS-1 systems.

When serdes_type is not specified in the command line or configuration file, the default behavior is none which means only the simplified functional test will run.

  • -pl_cfg <INI config file path> - This switch allows specifying the path to a configuration file for test customization.

./hl_qual -gaudi -c all -rmod parallel -f -pl_cfg config.ini

4.5.2. hl_qual and Test Plugin Configuration Files

Using a configuration file allows you to customize testing using predefined configuration tests for running your own test plans. Configuration files should be used when many configuration parameters are needed that cannot be included in the command line. The following must be met when using the configuration files:

  • hl_qual.ini - This is the hl_qual configuration file. This file cannot be changed or deleted from the hl_qual installation. Do not edit or change this file.

  • device.ini - Habana device setup configuration file. This file cannot be edited or deleted.

  • monitor.ini - The monitor configuration file. You can edit this file to customize the monitor sampling and display.

  • When the same test parameter could be extracted from a config file and from the command line, the command value always overrides the value extracted from the config file.

  • A config file that resides in a different folder must be included as a full path string.

  • Different plugins config data can be placed in one config file or split between different files for each plugin.

  • Config file format naming conventions must be used as specified in the following sections.

  • The test plugins configuration file, config.ini, is not used by default. To enforce this configuration file, use the -pl_cfg switch.

4.5.2.1. hl_qual ini Configuration File

The hl_qual.ini file is supplied in the hl_qual installation package. The file controls which plugin is loaded by the hl_qual application. You must not change the content of the file. Any change to this file may lead to unpredictable behavior of the hl_qual package.

4.5.2.2. device ini Configuration File

The device.ini file is supplied in the hl_qual installation package. The file controls Habanalabs device setup. You must not change the content of the file. Any change to this file may lead to unpredictable behavior of the hl_qual package.

4.5.2.3. Plugin ini Configuration File

The plugin INI file controls the behavior of the test plugins. Some of the available parameters can also be inserted from the command line. It should be noted that the command line parameters always override the configuration file parameters if the two options are used in the same test plugin run.

The Habana-labs hl_qual SW package contains a default plugin configuration file, config.ini, that includes parameters for all available test plugins. It is strongly recommended not to edit this file. You can create your own defined configurations based on this file.

The following format constructs should be noted:

  • plugin parameters section - This is denoted with square brackets. The following options are available:

    1. [STANDALONE_LOOPBACK]

    2. [NIC_BASE]

    3. [FUNCTIONAL_TEST]

    4. [POWER_STRESS]

    5. [PCI_BW]

  • comments - you can add comments on new lines using ; or #.

Each one of these options is aligned with one of the available test plugins.

4.5.2.3.1. Functional Test Plugin Configuration Syntax
[FUNCTIONAL_TEST]
json=hls1.json
;none, loopback, allgather, allreduce
test_type=none
TEST_DURATION=20
#AVAILABLE ==> INFO,DEBUG,WARNING,ERROR,NONE
LOG_LEVEL=NONE
#AVAILABLE ==> FILE,SCREEN,FILE_SCREEN
DEST_DEST=FILE
DISBALED_PORTS=[]
  • [FUNCTIONAL_TEST] - Functional test separator and parameters augmenter.

  • json - Functional test Allgather HCL Json configuration file path. This is needed only for Allgather tests.

  • test_type - Serdes test type selector [none, loopback, allgather, allreduce].

  • TEST_DURATION - Test duration in seconds.

  • LOG_LEVEL - Enable internal plugin logger. Available levels NONE, ERROR, WARNING, DEBUG, INFO.

  • DEST_DEST - Log destination FILE, SCREEN or FILE_SCREEN.

  • DISBALED_PORTS - This option is used to disable ports during Serdes tests. For example to disable ports 0,2,3,4,5,6,7 you may use DISABLED_PORTS=[0,2,3,4,5,6,7]. Port disabling is useful when using loopback dongle on external ports.

4.5.2.3.2. Standalone Loopback Test Plugin Configuration Syntax
[STANDALONE_LOOPBACK]
TEST_DURATION=20
WR_ENABLE=true
STREAMS_ENABLE=true
BUFF_SIZE=52428800
#AVAILABLE ==> INFO,DEBUG,WARNING,ERROR,NONE
LOG_LEVEL=NONE
#AVAILABLE ==> FILE,SCREEN,FILE_SCREEN
DEST_DEST=FILE
DISBALED_PORTS=[]
  • [STANDALONE_LOOPBACK] - Standalone loopback test separator and parameters augmenter.

  • WR_ENABLE - Enables usage of read/write API instead of send/receive API.

  • STREAMS_ENABLE - Uses streams synchronization.

  • TEST_DURATION - Test duration in seconds.

  • BUFF_SIZE - Send/receive buffer size.

  • LOG_LEVEL - Enable internal plugin logger. Available levels NONE, ERROR, WARNING, DEBUG, INFO.

  • DEST_DEST - Log destination FILE, SCREEN or FILE_SCREEN

  • DISBALED_PORTS - This option is used to disable ports during serdes tests. For example to disable ports 0,2,3,4,5,6,7 you may use DISABLED_PORTS=[0,2,3,4,5,6,7]. Port disabling is useful when using loopback dongle on external ports.

4.5.2.3.3. Power Stress and EDP Test Plugin Configuration Syntax
[POWER_STRESS]
;Power stress or EDP test duration in seconds
TEST_DURATION=30
;TYPE ==>stress or edp
TYPE=stress
;LEVEL ==> extreme or low
LEVEL=low
; TW high power duration usage in the EDP power cycles
TW=1000
; TW low power (device in IDLE) duration usage in the EDP power cycles
TS=1000
#AVAILABLE ==> INFO,DEBUG,WARNING,ERROR,NONE
LOG_LEVEL=NONE
#AVAILABLE ==> FILE,SCREEN,FILE_SCREEN
DEST_DEST=FILE
  • [POWER_STRESS] - Power stress and EDP test separator and parameters augmenter.

  • TYPE - Test type STRESS or EDP.

  • LEVEL - Test power level extreme or low.

  • TEST_DURATION - Test duration in seconds.

  • TW - Power cycle high power duration in milliseconds.

  • TS - Power cycle low power duration in milliseconds.

  • LOG_LEVEL - Enable internal plugin logger. Available levels NONE, ERROR, WARNING, DEBUG, INFO.

  • DEST_DEST - Log destination FILE, SCREEN or FILE_SCREEN.

  • DISBALED_PORTS - This option is used to disable ports during Serdes tests. For example to disable ports 0,2,3,4,5,6,7 you may use DISABLED_PORTS=[0,2,3,4,5,6,7]. Port disabling is useful when using loopback dongle on external ports.

4.5.2.3.4. PCI Bandwidth Test Plugin Configuration Syntax
[PCI_BW]
BIDIRECTIONAL=true
TEST_DURATION=20
DISABLE_VALIDATION=true
;the bandwidth limits presented here assumes Gen3 x16 devices
PCI_GEN=gen3
;upload and download host buffer size
BUFF_SIZE=204800000
  • [PCI_BW] - PCI test separator and parameters augmenter.

  • BIDIRECTIONAL - Bidirectional PCI bandwidth test enable true or false.

  • DISABLE_VALIDATION - Disable bandwidth validation true or false.

  • TEST_DURATION - Test duration in seconds.

  • PCI_GEN - System PCI generation. Host and device PCI data path including all bridges gen3 or gen4.

  • BUFF_SIZE - Upload and download buffer size in bytes.

4.5.2.3.5. Serdes Base Test Plugin Configuration Syntax
[NIC_BASETEST]
TEST_DURATION=20
#BUFF_SIZE=52428800
#SUB TEST TYPE: loopback, ext_loopback,pairs,ext_pairs
SUB_TEST=pairs
#DIABLED PORT EXAMPLE: [0,1,2]
DISBALED_PORTS=[]
#AVAILABLE ==> INFO,DEBUG,WARNING,ERROR,NONE
LOG_LEVEL=NONE
#AVAILABLE ==> FILE,SCREEN,FILE_SCREEN
DEST_DEST=FILE
  • [NIC_BASETEST] - Serdes base test separator and parameters augmenter.

  • SUB_TEST - Enables specific sub test available options: loopback, ext_loopback, pairs, ext_pairs.

  • TEST_DURATION - Test duration in seconds.

  • BUFF_SIZE - Send/receive buffer size.

  • LOG_LEVEL - Enable internal plugin logger. available levels NONE, ERROR, WARNING, DEBUG, INFO.

  • DEST_DEST - Log destination FILE, SCREEN or FILE_SCREEN.

  • DISBALED_PORTS - This option is used to disable ports during Serdes tests. For example, to disable ports 0,2,3,4,5,6,7 you may use DISABLED_PORTS=[0,2,3,4,5,6,7]. Port disabling is useful when using loopback dongle on external ports.

4.5.2.4. monitor ini Configuration File

The following sections are allowed:

  • [TEMP_MON] - Temperature monitoring parameter section.

  • [POWER_MON] - Power usage monitoring parameter section.

  • [CLOCK_MON] - Clock monitoring parameter section.

  • [MEM_MON] - Memory usage monitoring parameter section.

  • [ECC_MON] - ECC errors monitoring parameter section.

  • [PCI_REPLAY_MON] - PCI replay monitoring parameter section.

  • [PCI_BW_MON] - PCI bandwidth monitoring parameter section.

The following INI snippets show the allowable fields:

[TEMP_MON]
enable=true
LOW=15
HIGH=75
[POWER_MON]
enable=true
LOW=45
HIGH=340
[CLOCK_MON]
enable=true
LOW=1850
HIGH=1950
[MEM_MON]
enable=false
HIGH=30720
[ECC_MON]
enable=true
[PCI_REPLAY_MON]
enable=false
HIGH=1000000000
[PCI_BW_MON]
enable=false;
LOW=4
HIGH=10
  • enable - Enables or disables monitoring a specific value.

  • LOW - States the specific allowable low value for the monitored parameter. If the measured value is below that threshold, the monitor marks it in red on the monitoring UI.

  • HIGH - States the specific allowable high value for the monitored parameter. If the measured value is above that threshold, the monitor marks it in red on the monitoring UI.

Note

Disabling the monitoring on specific values will make the sampling process work faster and improve the monitor UI refresh rate, especially when the system contains multiple Gaudi devices.

4.6. hl_qual Expected Output and Failure Debug

4.6.1. hl_qual Expected Output

The hl_qual generates a test report. The test report is printed to the screen as well as to a log file. The test report log file naming convention is ServerName_hl_qual_report_TIMESTAMP.log. For example, k24-u18-60a_hl_qual_report_Mon_Dec_4_21-44-01_2020.log.

../_images/hl_qual_report_example.jpg

Figure 4.6 hl_qual Test Report

Figure 4.6 shows an example of an hl_qual report. This report show a single device test and includes the following:

  • Detected device report.

  • Numa node and CPU to device allocation report.

  • Command line report.

  • Test report - This includes a test report on each tested device with device information, plugin report and error printout as well as test results outcome when applicable.

  • Device measurement report - This includes min/max values of measurement metrics like temperature, power usage, clock data and ECC errors.

  • Test report summary of the tests’ results - This includes failing and passing results on specific devices.

All the sections of the report are identical between the different tests except for the plugin test results report.

4.6.2. hl_qual Failure Debug

Due to the complexity of server systems, a malfunction of some HW modules could influence the performance of several tests. It is recommended to follow a test plan where testing the basic HW components like PCI, SERDES using the simplified test goes first and and only then moving on to more complex tests such as power stress and functional tests.

Figure 4.7 shows the recommended test plan using hl_qual tool.

../_images/hl_qual_test_plan.jpg

Figure 4.7 hl_qual Test Plan

Habana recommends executing long test runs especially when using power stress, EDP and functional tests (including all sub-tests). Running these tests for 12 hours could expose cooling problems and overheating issues.

In case of test failures, generating log files is recommended.

4.6.2.1. Generating Log Files

As part of the test report, you can generate the reports listed below and send the reports to Habana for further support.

  • demsg report - Clear the dmesg log run. Run test and than collect the dmesg log:

sudo dmesg -C
./hl_qual -gaudi -c all -rmod serial -t 5 -p -b
dmesg -HT > demsg.log
  • hl-smi - To generate the report, run the following command:

hl-smi -q > hl-smi.log
  • Test plugin log - Enable the test plugin log in the config.ini file and run the test. Please note that the hl_qual must use the -pl_cfg switch. For example:

./hl_qual -gaudi -c all -rmod parallel -t 30 -f -serdes_type allgather -j hls1.json **-pl_cfg config.ini**
  • lspci report - Use the following command line:

lspci -vvnn

4.6.2.2. Debugging Specific Issues

  • PCI bandwidth issues:

    • Verify that the path between host to device including PCI bridges are Gen3 with x16 width.

    • Verify the correct Numa node assignment in the hl_qual report.

    • Verify correct setup of PCI retimers between host and device.

  • Serdes issues:

    • Check that all links are up (appears in the dmesg).

    • Check for any assertions due to HCL wait loops.

  • Power stress issues:

    • Check clock throttling followed by device reset.

    • Check reported temperature and compare to allowed max temperature.

4.7. Abbreviations and Commonly Used Terms

Abbreviation/ Term

Description

Plugin

A test library run by the hl_qual to perform a specific testing task. all plugins are implemented as dynamically linked libraries

Plugin under test

The plugin which was chosen to be run through the command line options or via configuration file set.

HLML

Management library for accessing Habana devices, enabling the reading of working condition parameters like temperature, clocks, power usage and error condition the HW met.

Internal port

Ports that connect devices that within the server box.

External port

External ports or scale-out ports are ports that connect a Gaudi port to a Gaudi residing in a different server box or to an external switch.