Memory Stress Test Plugins Design, Switches and Parameters
On this Page
Memory Stress Test Plugins Design, Switches and Parameters¶
This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.
Gaudi 3/2 HBM Stress Plugin Design Consideration and Responsibilities¶
The HBM stress plugin verifies memory integrity under memory transfers stress. The test plugin consists of the below sub-testing modes:
HBM_DMA_STRESS
- The memory transaction (read/write) executed using the DMA engines.HBM_TPC_STRESS
- The memory transaction (read/write) executed using TPC engine load/store commands. You can run this sub-test in four sub-modes: read, write, read_write, and full_rw that runs all three other sub-modes in the same test. The default sub-mode is read_write.HBM_FULL_DATA_CHECK
- The plugin performs read and write pass on all HBM memory locations.
Note
You can choose only one sub-test type per run. Multiple combination runs are not supported and will return an error indication.
(Gaudi 3) The HBM stress tests are low-level tests that use a special mode that ensures low latency and fast execution. This mode does not leave a trace on the utilization calculation when running the
hl-smi
tool. The tool outputs 0% utilization.
The HBM stress test runs iteratively, with each iteration comprising the following stages:
Running one of the sub-modes described above.
Performing ECC error readout, which includes the identification of single correctable errors and double non-correctable errors. This stage can be skipped using the
-skip_val
switch.Performing a device reset. Each testing process resets only the device being tested. This stage can be skipped using the
-skip_rst
switch.
You can set the number of iterations to execute using the -i
switch.
The duration of each iteration (including test run, errors readout and reset) is approximately 5-6 minutes for Gaudi 2.
For Gaudi 3, the duration depends on the testing mode. See HBM Stress Test Plugin Switches and Parameters.
Note
The number of iterations is not bound, however, a large number of iterations could increase the running time.
Prerequisites¶
For Gaudi 2 only: Before running this test, load the driver:
sudo modprobe habanalabs timeout_locked=0
HBM Stress Test - Pass/Fail Criteria¶
The pass/fail criteria is verified per test iteration and consists of the following:
The selected HBM sub-test passed (
HBM_DMA_STRESS
,HBM_TPC_STRESS
orHBM_FULL_DATA_CHECK
).No ECC error detected during the test.
The device reset process, when applicable, completes successfully.
If any of the test iterations fail, the full plugin test run fails.
HBM Stress Test Plugin Switches and Parameters¶
hl_qual -gaudi3 -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test [-i <number of iteration >] [-skip_rst] [-skip_val]
Switches and Parameters |
Description |
---|---|
|
HBM DMA stress test selector. |
|
HBM TPC stress engine test selector. Four sub-modes are available to select: |
|
HBM full data memory check test selector. |
|
Skips device reset after completing the HBM super stress test. |
|
Skips ECC check after completing the HBM super stress test. |
|
Number of iterations for the test. Applicable range: 1-1024. The duration is as follows:
|
The below example runs the test for two iterations:
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi3 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw
hl_qual -gaudi2 -c <pci bus id> -rmod <parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-hbm_dma_stress | -hbm_tpc_stress <read | write | read_write | full_rw> | -full_hbm_data_check_test [-i <number of iteration >] [-skip_rst] [-skip_val]
Switches and Parameters |
Description |
---|---|
|
HBM DMA stress test selector. |
|
HBM TPC Stress engine test selector. Four sub-modes are available to select: |
|
HBM full data memory check test selector. |
|
Skips device reset after completing the HBM super stress test. |
|
Skips ECC check after completing the HBM super stress test. |
|
Number of iterations for the test. Applicable range: 1-1024. Each iteration takes about 6 minutes. |
The below example runs the test for two iterations:
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_dma_stress -skip_rst
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress
./hl_qual -gaudi2 -c all -rmod parallel -i 2 -hbm_tpc_stress full_rw