Memory Stress Test Plugins Design, Switches and Parameters
On this Page
Memory Stress Test Plugins Design, Switches and Parameters¶
This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.
Gaudi 3/2 HBM Stress Plugin Design Consideration and Responsibilities¶
The HBM stress plugin verifies memory integrity under memory transfers stress. The test plugin consists of the below sub-testing modes:
HBM_DMA_STRESS
- The memory transaction (read/write) executed using the DMA engines.HBM_TPC_STRESS
- The memory transaction (read/write) executed using TPC engine load/store commands. You can run this sub-test in four sub-modes: read, write, read_write, and full_rw that runs all three other sub-modes in the same test. The default sub-mode is read_write.HBM_FULL_DATA_CHECK
- The plugin performs read and write pass on all HBM memory locations.
Note
You can choose only one sub-test type per run. Multiple combination runs are not supported and will return an error indication.
(Gaudi 3) The HBM stress tests are low-level tests that use a special mode that ensures low latency and fast execution. This mode does not leave a trace on the utilization calculation when running the
hl-smi
tool. The tool outputs 0% utilization.
The HBM stress test runs iteratively, with each iteration comprising the following stages:
Running one of the sub-modes described above.
Performing ECC error readout, which includes the identification of single correctable errors and double non-correctable errors. This stage can be skipped using the
-skip_val
switch.Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the
-skip_rst
switch.
You can set the number of iterations to execute using the -i
switch.
The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.
Note
The number of iterations is not bound, however, a large number of iterations could increase the running time.
Prerequisites¶
Before running this test on Gaudi 2, habanalabs
drivers must be set with the timeout_locked=0
flag:
sudo rmmod habanalabs sudo modprobe habanalabs timeout_locked=0
HBM Stress Test - Pass/Fail Criteria¶
The pass/fail criteria is verified per test iteration and consists of the following:
The selected HBM sub-test passed (
HBM_DMA_STRESS
,HBM_TPC_STRESS
orHBM_FULL_DATA_CHECK
).No ECC error detected during the test.
The device reset process, when applicable, completes successfully.
If any of the test iterations fail, the full plugin test run fails.
First-gen Gaudi HBM Stress Plugin Design Consideration and Responsibilities¶
The HBM stress plugin is a stress test based on memory transfers using DMA. The test includes the following testing iterations:
Running the DMA all2all HBM super stress.
Performing SERR/DERR counters readout to check if a memory error happened during the HBM test. This stage can be skipped using the
-skip_val
switch.Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the
-skip_rst
switch.
You can set the number of iterations to execute using the -i
switch.
The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes.
Note
The number of iterations is not bound, however, a large number of iterations could increase the running time.
Prerequisites¶
habanalabs
drivers must be set with thetimeout_locked=500
flag:sudo rmmod habanalabs sudo modprobe habanalabs timeout_locked=500
The test must be run with sudo privileges.
HBM Stress Test - Pass/Fail Criteria¶
The pass/fail criteria consists of the following:
The DMA all2all super stress must finish successfully.
No SERR/DERR indication should be encountered during the DMA test.
The device reset process, when applicable, completes successfully.
HBM Stress Test Plugin Switches and Parameters¶
hl_qual -gaudi3 | -gaudi2 -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test [-i <number of iteration >] [-skip_rst] [-skip_val]
Switches and Parameters |
Description |
---|---|
|
HBM DMA stress test selector. |
|
HBM TPC Stress engine test selector. Four sub-modes are available to select: |
|
HBM full data memory check test selector. |
|
Skips device reset after completing the HBM super stress test. |
|
Skips ECC check after completing the HBM super stress test. |
|
Number of iterations for the test. Each iteration takes about 6 minutes. |
The below example runs the test for two iterations:
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
sudo -E hl_qual -gaudi -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-hbm_stress [-i <number of iteration >] [-skip_rst] [-skip_val]
Switches and Parameters |
Description |
---|---|
|
HBM stress test selector. |
|
Skips device reset after completing the HBM super stress test. |
|
Skips SERR/DERR check after completing the HBM super stress test. |
|
Number of iterations for the test. Each iteration takes about 6 minutes. |
The below example command runs the test for 2 iterations for about 12 minutes:
sudo -E ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress
sudo -E ./hl_qual -gaudi -c all -rmod parallel -i 2 -hbm_stress -skip_rst