Memory Stress Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

First-gen Gaudi HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes the following testing iterations:

  • Run the DMA all2all HBM super stress.

  • Perform SERR/DERR counters readout to check if memory error happened during the HBM test.

  • Perform a device reset. Each testing process will reset just the device it is testing. This stage can be skipped by the user.

You may change how many times to execute the above testing iterations by using the -i switch.

Note

The test must be run with superuser permissions and set the habanalabs drivers with the correct flags in order to use the plugin. the duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria is composed of the following:

  • The DMA all2all super stress must finish successfully.

  • No SERR/DERR indication should be encountered during the DMA test.

  • The device reset process (when applicable) must finish with success.

Gaudi2 HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin verifies memory integrity under memory transfers stress using three sub-tests. You can choose only one sub-test type per run. Multiple combination runs are not allowed and will return an error indication.

  • HBM_DMA_STRESS - the memory transaction (read/write) executed using the DMA engines.

  • HBM_TPC_STRESS - the memory transaction (read/write) executed using TPC engin load/store commands

  • HBM_FULL_DATA_CHECK - the plugin performs read and write pass on all HBM memory locations

In case of hbm_tpc_stress, it is possible to run in four sub-modes, read, write, read_write, and full_rw that runs all three other sub-modes in the same test.

The test can be run in an iterative manner where each iteration includes the following stages:

  • Run the one of the sub test described above.

  • Perform ECC error readout (Single correctable errors, double none correctable errors).

  • Perform a device reset. Each testing process will reset just the device it is testing. This stage can be skipped by the user.

The number of iteration is not bound, but user should notice that large number of iteration could result in very long running time.

You may change how many times to execute the above testing iterations by using the -i switch.

Note

The test must be run with superuser permissions and set the habanalabs drivers with the correct flags in order to use the plugin. the duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria which is verified per test iteration is composed of the following:

  • The chosen HBM test passed (HBM_DMA_STRESS, HBM_TPC_STRESS or HBM_FULL_DATA_CHECK).

  • No ECC error detected during the test.

  • The device reset process (when applicable) must finish with success.

Failure in one of the test iteration will cause the failure of the full test.

HBM Stress Test Plugin Switches and Parameters

Some of HBM stress tests has pre-conditions that must be obeyed before running the test. The preconditions are different between first-gen Gaudi and Gaudi2 devices:

  • Habanalabs driver must be loaded with timeout_locked=500

    sudo rmmod habanalabs
    
    sudo modprobe habanalabs timeout_locked=500
    
  • The test must be run with sudo privileges.

  • Habanalabs driver must be loaded with timeout_locked=0

    sudo rmmod habanalabs
    
    sudo modprobe habanalabs timeout_locked=0
    

First-gen Gaudi and Gaudi2 test variants differ in the test selector switches as shown in the table below:

Device Type

Applicable Selector Switch

First-gen Gaudi

hbm_stress

Gaudi2

  • hbm_dma_stress

  • hbm_tpc_stress

  • full_hbm_data_check_test

sudo -E   hl_qual -gaudi -c <pci bus id>  -rmod <parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
                -hbm_stress  [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_stress

HBM stress test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips Serr/Derr check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example command runs the test for 2 iterations for about 12 minutes:

sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress
sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress -skip_rst
hl_qual -gaudi2 -c <pci bus id>  -rmod <parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
                -hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test  [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_dma_stress

HBM DMA stress test selector.

-hbm_tpc_stress

HBM TPC Stress engine test selector.

Four sub-modes are available, choose one sub-mode: read, write, read_write, full_rw. The default sub-mode is read_write. The sub mode full_rw runs all applicable tests one after the other and only then preforms the reset sequence.

-full_hbm_data_check _test

HBM full data memory check test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips Serr/Derr check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example runs the test for 2 iterations:

./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw