Memory Stress Test Plugins Design, Switches and Parameters

This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.

Gaudi 3/2 HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin verifies memory integrity under memory transfers stress. The test plugin consists of the below sub-testing modes:

  • HBM_DMA_STRESS - The memory transaction (read/write) executed using the DMA engines.

  • HBM_TPC_STRESS - The memory transaction (read/write) executed using TPC engine load/store commands. You can run this sub-test in four sub-modes: read, write, read_write, and full_rw that runs all three other sub-modes in the same test. The default sub-mode is read_write.

  • HBM_FULL_DATA_CHECK - The plugin performs read and write pass on all HBM memory locations.

Note

  • You can choose only one sub-test type per run. Multiple combination runs are not supported and will return an error indication.

  • (Gaudi 3) The HBM stress tests are low-level tests that use a special mode that ensures low latency and fast execution. This mode does not leave a trace on the utilization calculation when running the hl-smi tool. The tool outputs 0% utilization.

The HBM stress test runs iteratively, with each iteration comprising the following stages:

  • Running one of the sub-modes described above.

  • Performing ECC error readout, which includes the identification of single correctable errors and double non-correctable errors. This stage can be skipped using the -skip_val switch.

  • Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the -skip_rst switch.

You can set the number of iterations to execute using the -i switch. The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

Note

The number of iterations is not bound, however, a large number of iterations could increase the running time.

Prerequisites

Before running this test on Gaudi 2, habanalabs drivers must be set with the timeout_locked=0 flag:

sudo rmmod habanalabs

sudo modprobe habanalabs timeout_locked=0

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria is verified per test iteration and consists of the following:

  • The selected HBM sub-test passed (HBM_DMA_STRESS, HBM_TPC_STRESS or HBM_FULL_DATA_CHECK).

  • No ECC error detected during the test.

  • The device reset process, when applicable, completes successfully.

If any of the test iterations fail, the full plugin test run fails.

First-gen Gaudi HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes the following testing iterations:

  • Running the DMA all2all HBM super stress.

  • Performing SERR/DERR counters readout to check if a memory error happened during the HBM test. This stage can be skipped using the -skip_val switch.

  • Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the -skip_rst switch.

You can set the number of iterations to execute using the -i switch. The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

Note

The number of iterations is not bound, however, a large number of iterations could increase the running time.

Prerequisites

  • habanalabs drivers must be set with the timeout_locked=500 flag:

    sudo rmmod habanalabs
    
    sudo modprobe habanalabs timeout_locked=500
    
  • The test must be run with sudo privileges.

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria consists of the following:

  • The DMA all2all super stress must finish successfully.

  • No SERR/DERR indication should be encountered during the DMA test.

  • The device reset process, when applicable, completes successfully.

HBM Stress Test Plugin Switches and Parameters

hl_qual -gaudi3 | -gaudi2 -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
         -hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_dma_stress

HBM DMA stress test selector.

-hbm_tpc_stress

HBM TPC Stress engine test selector.

Four sub-modes are available to select: read, write, read_write, full_rw. The default sub-mode is read_write. The sub-mode full_rw runs all applicable tests one after the other and only then performs the reset sequence.

-full_hbm_data_check _test

HBM full data memory check test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips ECC check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example runs the test for two iterations:

./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
sudo -E   hl_qual -gaudi -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
          -hbm_stress [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_stress

HBM stress test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips SERR/DERR check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example command runs the test for 2 iterations for about 12 minutes:

sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress
sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress -skip_rst