Memory Stress Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

First-gen Gaudi HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes the following testing iterations:

  • Run the DMA all2all HBM super stress.

  • Perform SERR/DERR counters readout to check if memory error happened during the HBM test.

  • Perform a device reset. Each testing process will reset just the device it is testing. This stage can be skipped by the user.

You may change how many times to execute the above testing iterations by using the -i switch.

Note

The test must be run with superuser permissions and set the habanalabs drivers with the correct flags in order to use the plugin. the duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

Pass/fail Criteria

The pass/fail criteria is composed of the following:

  • The DMA all2all super stress must finish successfully.

  • No SERR/DERR indication should be encountered during the DMA test.

  • The device reset process (when applicable) must finish with success.

Gaudi2 HBM Stress Plugin Design Consideration and Responsibilities

The HBM stress plugin verifies memory integrity under memory transfers stress using three sub test:

  • HBM_DMA_STRESS - the memory transaction (read/write) executed using the DMA engines.

  • HBM_TPC_STRESS - the memory transaction (read/write) executed using TPC engin load/store commands

  • HBM_FULL_DATA_CHECK - the plugin performs read and write pass on all HBM memory locations

User may choose only one sub test type per run, multiple combination runs are not allowed and will return an error indication.

The test may be run in an iterative manner where each iteration includes the following stages:

  • Run the one of the sub test described above.

  • Perform ECC error readout (Single correctable errors, double none correctable errors).

  • Perform a device reset. Each testing process will reset just the device it is testing. This stage can be skipped by the user.

The number of iteration is not bound, but user should notice that large number of iteration could result in very long running time.

You may change how many times to execute the above testing iterations by using the -i switch.

Note

The test must be run with superuser permissions and set the habanalabs drivers with the correct flags in order to use the plugin. the duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.

Pass/fail Criteria

The pass/fail criteria which is verified per test iteration is composed of the following:

  • The chosen HBM test passed (HBM_DMA_STRESS, HBM_TPC_STRESS or HBM_FULL_DATA_CHECK).

  • No ECC error detected during the test.

  • The device reset process (when applicable) must finish with success.

Failure in one of the test iteration will cause the failure of the full test.

HBM Stress Test Plugin Switches and Parameters

Some of HBM stress tests has pre-conditions that must be obeyed before running the test. The preconditions are different between first-gen Gaudi and Gaudi2 devices:

  • Habanalabs driver must be loaded with timeout_locked=500

    sudo rmmod habanalabs
    
    sudo modprobe habanalabs timeout_locked=500
    
  • The test must be run with sudo privileges.

  • Habanalabs driver must be loaded with timeout_locked=0

    sudo rmmod habanalabs
    
    sudo modprobe habanalabs timeout_locked=0
    
  • The test must be run with sudo privileges.

First-gen Gaudi and Gaudi2 test variants differ in the test selector switches as shown in the table below:

Device Type

Applicable Selector Switch

First-gen Gaudi

hbm_stress

Gaudi2

  • hbm_dma_stress

  • hbm_tpc_stress

  • full_hbm_data_check_test

sudo -E hl_qual -gaudi -c <pci bus id>  -rmod <parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
                -hbm_stress  [-i <number of iteration >] [-pl_cfg <plugin INI config path>] [-skip_rst] [-skip_val]
  • -hbm_stress - HBM stress test selector.

The below example command runs the test for 2 iterations for about 12 minutes:

sudo -E ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress
sudo -E ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress -skip_rst
sudo -E hl_qual -gaudi2 -c <pci bus id>  -rmod <parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
                -hbm_dma_stress|-hbm_tpc_stress|-full_hbm_data_check_test  [-i <number of iteration >] [-skip_rst] [-skip_val]
  • -hbm_dma_stress - HBM dma Stress engine test selector.

  • -hbm_tpc_stress - HBM TPC Stress engine test selector.

  • -full_hbm_data_check_test - HBM full data memory check test selector.

The below example runs the test for 2 iterations:

sudo -E ./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
sudo -E ./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
sudo -E ./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
  • -skip_rst - Skips device reset after completing the HBM super stress test.

  • -skip_val - Skips Serr/Derr check after completing the HBM super stress test.

  • -i - Number of iterations for the test. Each iteration takes about 6 minutes.