Memory Stress Test Plugins Design, Switches and Parameters
On this Page
Memory Stress Test Plugins Design, Switches and Parameters¶
This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.
First-gen Gaudi HBM Stress Plugin Design Consideration and Responsibilities¶
The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes the following testing iterations:
Run the DMA all2all HBM super stress.
Perform SERR/DERR counters readout to check if memory error happened during the HBM test.
Perform a device reset. Each testing process will reset just the device it is testing. This stage can be skipped by the user.
You may change how many times to execute the above testing iterations by using the -i
switch.
Note
The test must be run with superuser permissions and set the habanalabs drivers with the correct flags in order to use the plugin. the duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.
HBM Stress Test - Pass/Fail Criteria¶
The pass/fail criteria is composed of the following:
The DMA all2all super stress must finish successfully.
No SERR/DERR indication should be encountered during the DMA test.
The device reset process (when applicable) must finish with success.
Gaudi2 HBM Stress Plugin Design Consideration and Responsibilities¶
The HBM stress plugin verifies memory integrity under memory transfers stress using three sub-tests. You can choose only one sub-test type per run. Multiple combination runs are not allowed and will return an error indication.
HBM_DMA_STRESS - the memory transaction (read/write) executed using the DMA engines.
HBM_TPC_STRESS - the memory transaction (read/write) executed using TPC engin load/store commands
HBM_FULL_DATA_CHECK - the plugin performs read and write pass on all HBM memory locations
In case of hbm_tpc_stress
, it is possible to run in four sub-modes, read, write, read_write,
and full_rw that runs all three other sub-modes in the same test.
The test can be run in an iterative manner where each iteration includes the following stages:
Run the one of the sub test described above.
Perform ECC error readout (Single correctable errors, double none correctable errors).
Perform a device reset. Each testing process will reset just the device it is testing. This stage can be skipped by the user.
The number of iteration is not bound, but user should notice that large number of iteration could result in very long running time.
You may change how many times to execute the above testing iterations by using the -i
switch.
Note
The test must be run with superuser permissions and set the habanalabs drivers with the correct flags in order to use the plugin. the duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.
HBM Stress Test - Pass/Fail Criteria¶
The pass/fail criteria which is verified per test iteration is composed of the following:
The chosen HBM test passed (HBM_DMA_STRESS, HBM_TPC_STRESS or HBM_FULL_DATA_CHECK).
No ECC error detected during the test.
The device reset process (when applicable) must finish with success.
Failure in one of the test iteration will cause the failure of the full test.
HBM Stress Test Plugin Switches and Parameters¶
Some of HBM stress tests has pre-conditions that must be obeyed before running the test. The preconditions are different between first-gen Gaudi and Gaudi2 devices:
Habanalabs driver must be loaded with timeout_locked=500
sudo rmmod habanalabs sudo modprobe habanalabs timeout_locked=500
The test must be run with sudo privileges.
Habanalabs driver must be loaded with timeout_locked=0
sudo rmmod habanalabs sudo modprobe habanalabs timeout_locked=0
First-gen Gaudi and Gaudi2 test variants differ in the test selector switches as shown in the table below:
Device Type |
Applicable Selector Switch |
---|---|
First-gen Gaudi |
hbm_stress |
Gaudi2 |
|
sudo -E hl_qual -gaudi -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-hbm_stress [-i <number of iteration >] [-skip_rst] [-skip_val]
Switches and Parameters |
Description |
---|---|
|
HBM stress test selector. |
|
Skips device reset after completing the HBM super stress test. |
|
Skips Serr/Derr check after completing the HBM super stress test. |
|
Number of iterations for the test. Each iteration takes about 6 minutes. |
The below example command runs the test for 2 iterations for about 12 minutes:
sudo -E ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress
sudo -E ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress -skip_rst
hl_qual -gaudi2 -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test [-i <number of iteration >] [-skip_rst] [-skip_val]
Switches and Parameters |
Description |
---|---|
|
HBM DMA stress test selector. |
|
HBM TPC Stress engine test selector. Four sub-modes are available, choose one sub-mode: read, write, read_write, full_rw. The default sub-mode is read_write. The sub mode full_rw runs all applicable tests one after the other and only then preforms the reset sequence. |
|
HBM full data memory check test selector. |
|
Skips device reset after completing the HBM super stress test. |
|
Skips Serr/Derr check after completing the HBM super stress test. |
|
Number of iterations for the test. Each iteration takes about 6 minutes. |
The below example runs the test for 2 iterations:
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw