Memory Stress Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

Gaudi 2 HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin verifies memory integrity under memory transfers stress. The test plugin consists of the below sub-testing modes:

  • HBM_DMA_STRESS - the memory transaction (read/write) executed using the DMA engines.

  • HBM_TPC_STRESS - the memory transaction (read/write) executed using TPC engine load/store commands. You can run this sub-test in four sub-modes: read, write, read_write, and full_rw that runs all three other sub-modes in the same test. The default sub-mode is read_write.

  • HBM_FULL_DATA_CHECK - the plugin performs read and write pass on all HBM memory locations.

Note

You can choose only one sub-test type per run. Multiple combination runs are not allowed and will return an error indication.

You can run the HBM stress test iteratively, with each iteration comprising the following stages:

  • Running one of the sub-modes described above.

  • Performing ECC error readout, which includes the identification of single correctable errors and double non-correctable errors. This stage can be skipped using the -skip_val switch.

  • Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the -skip_rst switch.

The number of iterations is not bound, however, a large number of iterations could result in very long running time.

You may change how many times to execute the above testing iterations by using the -i switch.

Note

The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

Precondition

habanalabs drivers must be set with the timeout_locked=0 flag to use the plugin:

sudo rmmod habanalabs

sudo modprobe habanalabs timeout_locked=0

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria which is verified per test iteration consists of the following:

  • The chosen HBM sub-test passed (HBM_DMA_STRESS, HBM_TPC_STRESS or HBM_FULL_DATA_CHECK).

  • No ECC error detected during the test.

  • The device reset process (when applicable) must finish with success.

Failure in one of the test iterations will cause the failure of the full test.

First-gen Gaudi HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes the following testing iterations:

  • Running the DMA all2all HBM super stress.

  • Performing SERR/DERR counters readout to check if a memory error happened during the HBM test. This stage can be skipped using the -skip_val switch.

  • Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the -skip_rst switch.

You may change how many times to execute the above testing iterations by using the -i switch.

Note

The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

Preconditions

  • habanalabs drivers must be set with the timeout_locked=500 flag to use the plugin:

    sudo rmmod habanalabs
    
    sudo modprobe habanalabs timeout_locked=500
    
  • The test must be run with sudo privileges.

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria consists of the following:

  • The DMA all2all super stress must finish successfully.

  • No SERR/DERR indication should be encountered during the DMA test.

  • The device reset process (when applicable) must finish with success.

HBM Stress Test Plugin Switches and Parameters

hl_qual -gaudi2 -c <pci bus id>  -rmod <parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
                -hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test  [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_dma_stress

HBM DMA stress test selector.

-hbm_tpc_stress

HBM TPC Stress engine test selector.

Four sub-modes are available. Choose one sub-mode: read, write, read_write, full_rw. The default sub-mode is read_write. The sub-mode full_rw runs all applicable tests one after the other and only then performs the reset sequence.

-full_hbm_data_check _test

HBM full data memory check test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips ECC check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example runs the test for 2 iterations:

./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
sudo -E   hl_qual -gaudi -c <pci bus id>  -rmod <parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
                 -hbm_stress  [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_stress

HBM stress test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips Serr/Derr check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example command runs the test for 2 iterations for about 12 minutes:

sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress
sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress -skip_rst