Memory Stress Test Plugins Design, Switches and Parameters

This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.

Gaudi 3/2 HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin verifies memory integrity under memory transfers stress. The test plugin consists of the below sub-testing modes:

  • HBM_DMA_STRESS - The memory transaction (read/write) executed using the DMA engines.

  • HBM_TPC_STRESS - The memory transaction (read/write) executed using TPC engine load/store commands. You can run this sub-test in four sub-modes: read, write, read_write, and full_rw that runs all three other sub-modes in the same test. The default sub-mode is read_write.

  • HBM_FULL_DATA_CHECK - The plugin performs read and write pass on all HBM memory locations.

Note

  • You can choose only one sub-test type per run. Multiple combination runs are not supported and will return an error indication.

  • (Gaudi 3) The HBM stress tests are low-level tests that use a special mode that ensures low latency and fast execution. This mode does not leave a trace on the utilization calculation when running the hl-smi tool. The tool outputs 0% utilization.

The HBM stress test runs iteratively, with each iteration comprising the following stages:

  • Running one of the sub-modes described above.

  • Performing ECC error readout, which includes the identification of single correctable errors and double non-correctable errors. This stage can be skipped using the -skip_val switch.

  • Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the -skip_rst switch.

You can set the number of iterations to execute using the -i switch. The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes for Gaudi 2. For Gaudi 3, the duration depends on the testing mode. See HBM Stress Test Plugin Switches and Parameters.

Note

The number of iterations is not bound, however, a large number of iterations could increase the running time.

Prerequisites

For Gaudi 2 only: Before running this test, load the driver:

sudo modprobe habanalabs timeout_locked=0

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria is verified per test iteration and consists of the following:

  • The selected HBM sub-test passed (HBM_DMA_STRESS, HBM_TPC_STRESS or HBM_FULL_DATA_CHECK).

  • No ECC error detected during the test.

  • The device reset process, when applicable, completes successfully.

If any of the test iterations fail, the full plugin test run fails.

First-gen Gaudi HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes the following testing iterations:

  • Running the DMA all2all HBM super stress.

  • Performing SERR/DERR counters readout to check if a memory error happened during the HBM test. This stage can be skipped using the -skip_val switch.

  • Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the -skip_rst switch.

You can set the number of iterations to execute using the -i switch. The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

Note

The number of iterations is not bound, however, a large number of iterations could increase the running time.

Prerequisites

  • habanalabs drivers must be set with the timeout_locked=500 flag:

    sudo rmmod habanalabs
    
    sudo modprobe habanalabs timeout_locked=500
    
  • The test must be run with sudo privileges.

HBM Stress Test - Pass/Fail Criteria

The pass/fail criteria consists of the following:

  • The DMA all2all super stress must finish successfully.

  • No SERR/DERR indication should be encountered during the DMA test.

  • The device reset process, when applicable, completes successfully.

HBM Stress Test Plugin Switches and Parameters

hl_qual -gaudi3 -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
         -hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_dma_stress

HBM DMA stress test selector.

-hbm_tpc_stress

HBM TPC stress engine test selector.

Four sub-modes are available to select: read, write, read_write, full_rw. The default sub-mode is read_write. The sub-mode full_rw runs all applicable tests one after the other and only then performs the reset sequence.

-full_hbm_data_check _test

HBM full data memory check test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips ECC check after completing the HBM super stress test.

-i

Number of iterations for the test. The duration is as follows:

  • HBM DMA stress test - 80 seconds per iteration.

  • HBM TPC stress engine test - 30 seconds per iteration.

The below example runs the test for two iterations:

./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
hl_qual -gaudi2 -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
         -hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_dma_stress

HBM DMA stress test selector.

-hbm_tpc_stress

HBM TPC Stress engine test selector.

Four sub-modes are available to select: read, write, read_write, full_rw. The default sub-mode is read_write. The sub-mode full_rw runs all applicable tests one after the other and only then performs the reset sequence.

-full_hbm_data_check _test

HBM full data memory check test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips ECC check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example runs the test for two iterations:

./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
sudo -E   hl_qual -gaudi -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
          -hbm_stress [-i <number of iteration >] [-skip_rst] [-skip_val]

Switches and Parameters

Description

-hbm_stress

HBM stress test selector.

-skip_rst

Skips device reset after completing the HBM super stress test.

-skip_val

Skips SERR/DERR check after completing the HBM super stress test.

-i

Number of iterations for the test. Each iteration takes about 6 minutes.

The below example runs the test for two iterations:

sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress
sudo -E  ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress -skip_rst