Connectivity Serdes Test Plugins Design, Switches and Parameters
On this Page
Connectivity Serdes Test Plugins Design, Switches and Parameters¶
This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.
Serdes Base Test Design Consideration and Responsibilities¶
The Serdes base test performs basic sanity tests on the Serdes, ensuring the following parts are tested:
Internal and external port connectivity.
RX/TX data integrity - Checked by pre-calculated random data and compared against a reference data. The default transmitted buffer size is 128MB to ensure a variety of transmitted data. During the test, the TX buffer is transmitted multiple times according to the test execution time set.
Note
The Serdes base test can only be run in parallel mode.
Serdes Base Test Testing Modes¶
The test plugin consists of the below sub-testing modes:
pairs
- Runs over all available device pairs and tests the internal port connectivity. In this mode, the test runs a data integrity test on each pair of devices.allgather
- Runs over all available devices and computes the bandwidth of allgather functionality.allreduce
- Runs over all available devices and computes the bandwidth of allreduce functionality.loopback
- Runs over all available devices and tests the external ports connectivity. In this mode, the external ports must be fitted with loopback dongles. Calculates only bandwidth.dir_bw
(direct bandwidth) - Runs over all available devices. In each iteration one device is a sender and the rest are receivers. Calculates only bandwidth.
The allgather
and allreduce
variants also calculate the bandwidth for those network operations. You may run the test for a specific
number of epochs and iterations per epoch using the -i
switch. This generates a MAX, MIN and average bandwidth per epoch which may expose bandwidth
instabilities to the time duration of the test.
Note
allgather
, allreduce
, loopback
, and dir_bw
variants calculate the bandwidth for NIC port integrity only. They do not validate NIC port integrity.
pairs
- Runs over all available device pairs and tests the internal port connectivity. In this mode, the test runs a data integrity test on each pair of devices.allgather
- Runs over all available devices and computes the bandwidth of allgather functionality.allreduce
- Runs over all available devices and computes the bandwidth of allreduce functionality.
The allgather
and allreduce
variants also calculate the bandwidth for those network operations. You may run the test for a specific
number of epochs and iterations per epoch using the -i
switch. This generates a MAX, MIN and average bandwidth per epoch which may expose bandwidth
instabilities to the time duration of the test.
Note
Both allgather
and allreduce
variants calculate the bandwidth for NIC port integrity only. They do not validate NIC port integrity.
Serdes Base - Pass/Fail Criteria¶
The pass/fail criteria consists of two sub-criteria:
The test fails if the destination rank does not respond.
The test fails if data is not received as expected compared with the reference data.
Serdes Base Test Plugin Switches and Parameters¶
hl_qual -gaudi3 -c <pci bus id> [-i <inner loop iterations count>] [-ep <epoch count>] -rmod parallel [-dis_mon] [-mon_cfg <monitor INI path>]
-nic_base -test_type <pairs | allreduce | allgather | loopback | dir_bw> [-sz <size in bytes>] [-seed <seed>] [-enable_externals] [-toggle] [-enable_ports_check <all / int>]
Switches and Parameters |
Description |
---|---|
|
Defines the test type and port configuration to be used in your system. The different test type variants are:
|
|
Serdes base test selector. |
|
Send/receive buffer size in bytes. |
|
Iteration count for bandwidth computation. |
|
Epoch count for bandwidth computation. Note: This switch enables multiple epochs that consist of iterations.
You can change the number of iterations using the |
|
Seed value for generating the transmit patterns. This is a 32 bit number with the following hexadecimal pattern xxxxxxxx (for example, abc123da). The default pattern is 5a5a5a5a5a. |
|
Enable external ports check. The ports must be fitted with loopback dongles. |
|
Enables ports toggle check and counts number of port toggles during the test. |
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
./hl_qual -gaudi3 -c all -rmod parallel -i 50 -nic_base -test_type pairs
./hl_qual -gaudi3 -c all -rmod parallel -i 50 -ep 100 -nic_base -test_type allreduce
hl_qual -gaudi2 -c <pci bus id> [-i <inner loop iterations count>] [-ep <epoch count>] -rmod parallel [-dis_mon] [-mon_cfg <monitor INI path>]
-nic_base -test_type <pairs | allreduce | allgather | loopback | dir_bw> [-sz <size in bytes>] [-disable_ports <port list>] [-seed <seed>] [-enable_ports_check <int | all>] [-toggle]
Switches and Parameters |
Description |
---|---|
|
Defines the test type and port configuration to be used in your system. The different test type variants are:
|
|
Serdes base test selector. |
|
Send/receive buffer size in bytes. |
|
Iteration count for bandwidth computation. |
|
Epoch count for bandwidth computation. Note: This switch enables multiple epochs that consist of iterations.
You can change the number of iterations using the |
|
Seed value for generating the transmit patterns. This is a 32 bit number with the following hexadecimal pattern xxxxxxxx (for example, abc123da). The default pattern is 5a5a5a5a5a. |
|
Specifies which ports to disable.
For example, |
|
Enables ports toggle check and counts number of port toggles during the test. |
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
./hl_qual -gaudi2 -c all -rmod parallel -i 50 -nic_base -test_type pairs
./hl_qual -gaudi2 -c all -rmod parallel -i 50 -ep 100 -nic_base -test_type allreduce
hl_qual -gaudi -c <pci bus id> [-i <inner loop iterations count>] [-ep <epoch count>] -rmod parallel [-dis_mon] [-mon_cfg <monitor INI path>]
-nic_base -test_type <pairs | allreduce | allgather> [-sz <size in bytes>] [-disable_ports <port list>] [-seed <seed>] [-enable_ports_check <int | all>]
Switches and Parameters |
Description |
---|---|
|
Defines the test type and port configuration to be used in your system. The different test type variants are:
|
|
Serdes base test selector. |
|
Send/receive buffer size in bytes. |
|
Iteration count for bandwidth computation. |
|
Epoch count for bandwidth computation. Note: This switch enables multiple epochs that consist of iterations.
You can change the number of iterations using the |
|
Seed value for generating the transmit patterns. This is a 32 bit number with the following hexadecimal pattern xxxxxxxx (for example, abc123da). The default pattern is 5a5a5a5a5a. |
|
Specifies which ports to disable.
For example, |
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
./hl_qual -gaudi -c all -rmod parallel -i 20 -nic_base -test_type pairs
./hl_qual -gaudi -c all -rmod parallel -i 20 -ep 20 -nic_base -test_type allreduce
Gaudi 3/2 E2E Concurrency Test Plugin Design Consideration and Responsibilities¶
The E2E concurrency test plugin is a network bandwidth measurement test which calculates bandwidth per port and verifies transmission integrity. Since all verification processes run per port, you can identify port connectivity issues very easily. The test verifies all ports and provides calculated bandwidth per port, which offers testing benefits compared to the Serdes Base test which tests port pairs.
Note
(Gaudi 3) The E2E concurrency test is a low-level test that uses a
special mode that ensures low latency and fast execution. This mode
does not leave a trace on the utilization calculation when running the hl-smi
tool.
The tool outputs 0% utilization.
Prerequisites¶
For Gaudi 2 only: Before running this test, load the driver:
sudo modprobe habanalabs timeout_locked=0
E2E Concurrency Test - Pass/Fail Criteria¶
The pass/fail criteria depends on the calculated bandwidth and correctness. The criteria is:
Gaudi 3 - 190Gbps
Gaudi 2 - 97Gbps
Gaudi 3/2 E2E Concurrency Test Plugin Switches and Parameters¶
hl_qual -gaudi3 -c <pci bus id> -rmod <parallel> [-t <execution time in seconds>] [-dis_mon] [-mon_cfg <monitor INI path>] -e2e_concurrency [-enable_externals]
[-toggle] [-enable_ports_check <int | all>]
Switches and Parameters |
Description |
---|---|
|
Test selector |
|
Power stress test duration in seconds. |
|
Enables external ports checking. If not used, the external ports are not tested. |
|
Enables ports toggle check and counts number of port toggles during the test. |
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
./hl_qual -gaudi3 -c all -rmod parallel -dis_mon -e2e_concurrency
./hl_qual -gaudi3 -c all -rmod parallel -t 30 -dis_mon -e2e_concurrency -enable_ports_check int
hl_qual -gaudi2 -c <pci bus id> -rmod <parallel> [-t <execution time in seconds>] [-dis_mon] [-mon_cfg <monitor INI path>] -e2e_concurrency [-disable_ports] [-enable_ports_check <all | int>] [-toggle]
Switches and Parameters |
Description |
---|---|
|
Test selector |
|
Power stress test duration in seconds. |
|
Specifies which ports to disable.
For example, |
|
Enables ports toggle check and count number of port toggles during the test. |
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
./hl_qual -gaudi2 -c all -rmod parallel -dis_mon -e2e_concurrency
./hl_qual -gaudi2 -c all -rmod parallel -t 30 -dis_mon -e2e_concurrency -disable_ports 8,22,23 -enable_ports_check int
First-gen Gaudi E2E Serdes Test Plugin Design Consideration and Responsibilities¶
The E2E Serdes test is a multi-purpose test that can check any machine topology using a connectivity JSON file. The file describes the machine’s unique connectivity and verifies both connectivity and bandwidth. The test runs on multiple Gaudi cards, testing the ports simultaneously to report the collective bandwidth.
Note
To use your own port connectivity configuration, contact Intel Gaudi support.
E2E Serdes Test Testing Modes¶
external-lpbk
- A Serdes loopback communication test. This testing mode runs on all external ports of the NICs and sends data through a loopback dongle from each external port back to itself. When running this test mode, the external ports must be connected (RX to TX) using a loopback dongle.internal
- This mode sends data from each internal port to its connected port on a different first-gen Gaudi device.all-lpbk
- This mode runs the plugin on all internal and external ports simultaneously. The test of the external ports is loopback so all external ports must be connected (RX to TX) using a loopback dongle.
E2E Serdes Test- Pass/Fail Criteria¶
The pass/fail criteria consists of the following:
There is an E2E connection between ports.
There is no loopback.
The collective bandwidth is printed at the end of the test.
When the test runs in Sanity mode using the -executeSanity flag, it fails if the bit error exceeds 0.1%.
First-gen Gaudi E2E Serdes Test Plugin Switches and Parameters¶
sudo hl_qual -gaudi -e2e -port_map <json name> -test_mode <all-lpbk | internal | external-lpbk> -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>] [-enable_ports_check <all | int>]
Switches and Parameters |
Description |
---|---|
|
Serdes base test selector. |
|
A JSON file that defines the ports connectivity within the server box and which ports are external. |
|
Defines the test type and port configuration to be used in your system. The different test type variants are:
|
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
The below command runs the E2E test on both external ports with loopback dongles and internal ports. The network topology is defined by hls1.json
:
sudo ./hl_qual -c all -test_mode all-lpbk -rmod parallel -e2e -port_map hls1.json -gaudi
Gaudi 3/2 SER Test Plugin Design Consideration and Responsibilities¶
The SER test plugin is a Symbol Error Rate measurement test which calculates pre and post FEC SER for each port.
The SER test execution time could be lengthy and depends on the state of the Intel Gaudi Linux kernel driver. During this test, the devices undergo hard resets to perform various stages of the test. The test plugin needs to wait for the device to reach a stable state of operation after each hard reset.
In Gaudi 2, you can check Bit Error Rate (BER) measurement using -ber_enable
switch. The BER test calculates the port and lane bit error rate.
Note
(Gaudi 3) The SER test is a low-level test that uses a
special mode that ensures low latency and fast execution. This mode
does not leave a trace on the utilization calculation when running the hl-smi
tool.
The tool outputs 0% utilization.
SER Test - Pass/Fail Criteria¶
The pass/fail criteria depends on the calculated post FEC SER. The SER plugin pass/fail criteria for this mode is 1.0E-6 which represents the maximum acceptable value. If the test results exceed this threshold, it fails.
Gaudi 3/2 SER Test Plugin Switches and Parameters¶
hl_qual -gaudi3 -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>] -ser [-enable_ext]
Switches and Parameters |
Description |
---|---|
|
SER test selector. |
|
Enables external ports checking. |
./hl_qual -gaudi3 -c all -rmod parallel -dis_mon -ser
./hl_qual -gaudi3 -c all -rmod parallel -ser -enable_ext
hl_qual -gaudi2 -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>] -ser [-enable_ext] [-ber_enable] [-ckeck-all]
Switches and Parameters |
Description |
---|---|
|
SER test selector. |
|
Enables BER check. |
|
Enables external ports checking. |
|
Checks pre and post SER FEC and BER. |
./hl_qual -gaudi2 -c all -rmod parallel -dis_mon -ser
./hl_qual -gaudi2 -c all -rmod parallel -ser -enable_ext