Connectivity Serdes Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

Serdes Base Test Design Consideration and Responsibilities

The Serdes base test performs basic sanity tests on the Serdes, ensuring the following parts are tested:

  • Internal port connectivity

  • RX/TX data integrity - Checked by pre-calculated random data and compared against a reference data. The default transmitted buffer size is 128MB to ensure a variety of transmitted data. During the test, the TX buffer is transmitted multiple times according to the test execution time you set.

Serdes Base Test Testing Modes

The test plugin consists of the below sub-testing modes:

  • Pair test - Runs over all available device pairs and tests the internal port connectivity. In this mode, the test runs a data integrity test on each pair of devices.

  • All-Gather - Runs over all available devices and computes the bandwidth of all-gather functionality.

  • All-Reduce - Runs over all available devices and computes the bandwidth of all-reduce functionality.

The All-Gather and All-Reduce variants also calculate the bandwidth for those network operations. You may run the test for a specific number of epochs and specific number of iterations per epoch by using -i switch. This will generate a MAX, MIN and Average bandwidth per epoch which may expose bandwidth instabilities to the time duration of the test.

Note

Both All-Gather and All-Reduce calculate bandwidth for NIC port integrity only, and they do not validate NIC port integrity.

Serdes Base - Pass/Fail Criteria

The pass/fail criteria consists of two sub-criteria:

  • Connectivity - The test fails if the destination rank does not respond.

  • The test fails if data is not received as expected, compared with the reference data.

Serdes Base Test Plugin Switches and Parameters

First-gen Gaudi and Gaudi 2 test variants differ in the test capabilities as demonstrated in the command line below:

hl_qual -gaudi|gaudi2 -c <pci bus id> [-i <inner loop iterations count>] [-ep <epoch count>] -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
-nic_base  -test_type <pairs, allreduce, allgather> [-sz <size in bytes>] [-disable_ports <port list>] [-seed <seed>] [-enable_ports_check <int | all>]

Switches and Parameters

Description

-test_type <test type>

Defines the test type and port configuration to be used in your system. The different test type variants are:

  • pairs - Checks the internal port connectivity.

  • allgather - Computes the all-gather bandwidth.

  • allreduce - Computes the all-reduce bandwidth.

-nic_base

Serdes base test selector.

-sz

Send/receive buffer size in bytes.

-i

Iteration count for bandwidth computation.

-ep

Epoch count for bandwidth computation.

Note: This switch enables multiple epochs that consist of iterations. You can change the number of iterations by using -i switch. The -ep switch is applicable only for allgather and allreduce test modes.

--seed <seed>

Seed value for generating the transmit patterns. This is a 32 bit number with the following hexadecimal pattern xxxxxxxx (for example, abc123da). The default pattern is 5a5a5a5a5a.

disable_ports

Specifies which ports to disable. Example: -disable_ports [1,2,3].

-enable_ports_check <all / int>

Indicates whether the ports are UP or DOWN. If they are are DOWN, it will result in a test failure.

  • all - Checks all the external and internal ports.

  • int - Checks the internal ports only.

Examples:

-  ./hl_qual -gaudi2 -c all -rmod parallel -i 50 -nic_base -test_type pairs
-  ./hl_qual -gaudi2 -c all -rmod parallel -i 50 -ep 100 -nic_base -test_type allreduce
-  ./hl_qual -gaudi -c all -rmod parallel -i 20 -nic_base -test_type pairs
-  ./hl_qual -gaudi -c all -rmod parallel -i 20 -ep 20 -nic_base -test_type allreduce

Gaudi 2 E2E Concurrency Test Plugin Design Consideration and Responsibilities

The E2E concurrency test plugin is a network bandwidth measurement test. The test calculates bandwidth per port and verifies transmission integrity. Since all verification processes run per port, you can identify port connectivity issues very easily. The test verifies all ports and provides calculated bandwidth per port, which offers testing benefits compared to the Serdes Base test that involves pairing.

E2E Concurrency Test - Pass/fail Criteria

The pass/fail criteria depends on the calculated bandwidth and correctness. The criteria is 97GB/s.

Gaudi 2 E2E Concurrency Test Plugin Switches and Parameters

hl_qual -gaudi2  -c <pci bus id>  -rmod <parallel> [-t <execution time is seconds>]  [-dis_mon] [-mon_cfg <monitor INI path>] -e2e_concurrency [-disable_ports] [-enable_ports_check <all | int>]

Log File

Description

-e2e_concurrency

Test selector

-t

Power stress test duration in seconds.

-disable_ports

Specifies which ports to disable. Example: -disable_ports [1,2,3].

-enable_ports_check <int | all>

Indicates whether the ports are UP or DOWN. If they are are DOWN, it will result in a test failure.

  • all - Checks all the external and internal ports.

  • int - Checks the internal ports only.

./hl_qual -gaudi2 -c all -rmod parallel -dis_mon -e2e_concurrency
./hl_qual -gaudi2 -c all -rmod parallel -t 30 -dis_mon -e2e_concurrency -disable_ports 8,22,23 -enable_ports_check int

First-gen Gaudi E2E Serdes Test Plugin Design Consideration and Responsibilities

The E2E serdes test is a multi-purpose test that can check any machine topology using a connectivity json file that describes the machine’s unique connectivity and verifies both connectivity and bandwidth. The test runs on multiple Gaudi cards testing the ports simultaneously and reporting the collective bandwidth.

Note

If you use your own port connectivity configuration, contact us for support.

E2E Serdes Test Testing Modes

  • External-lpbk - A serdes loopback communication test. This testing mode runs on all external ports of the NICs and sends data through a loopback dongle from each external port back to itself. When running this test mode, the external ports must be connected (RX to TX) using a loopback dongle.

  • Internal - This mode sends data from each internal port to its connected port on a different first-gen Gaudi device.

  • All-lpbk - This mode runs the plugin on all internal and external ports simultaneously. The test of the external ports is loopback and so all external ports must be connected (RX to TX) using a loopback dongle.

    Note

    External-lpbk and ALL-lpbk testing modes can be used only if the external ports are equipped with the loopback dongles.

E2E Serdes Test- Pass/fail Criteria

The pass/fail criteria consists of the following:

  • There is an E2E connection between ports.

  • There is no loopback.

  • The collective bandwidth is printed at the end of the test.

  • When the test runs in Sanity mode using the -executeSanity flag, it fails if the bit error exceeds 0.1%.

First-gen Gaudi E2E Serdes Test Plugin Switches and Parameters

sudo hl_qual -gaudi -e2e -port_map <json name> -test_mode <all-lpbk | internal | external-lpbk> -c <pci bus id> -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>] [-enable_ports_check <all | int>]

Switches and Parameters

Description

-e2e

Serdes base test selector.

-port_map <name of configuration name>

A file that defines the ports connectivity within the server box and which ports are external.

-test_mode <test type>

Defines the test type and port configuration to be used in your system. The different test type variants are:

  • all-lpbk - Executes both internal and external tests. This mode only runs with -rmod parallel.

  • internal - Checks the internal port connectivity. This mode only runs with -rmod parallel.

  • external-lpbk - Tests all external ports, and assumes loopback dongle on all external ports.

-enable_ports_check <all / int>

Indicates whether the ports are UP or DOWN. If they are are DOWN, it will result in a test failure.

  • all - Checks all the external and internal ports.

  • int - Checks the internal ports only.

sudo ./hl_qual -c all -test_mode all-lpbk -rmod parallel -e2e -port_map hls1.json -gaudi

The above command line runs the E2E test on both external ports with loopback dongles and internal ports. The network topology is defined by hls1.json.

Gaudi 2 SER Test Plugin Design Consideration and Responsibilities

The SER test plugin is a Symbol Error Rate measurement test. It calculates pre and post FEC SER for each port.

There is an option to check also BER - bit error rate measurement. The BER test calculates the port and lane bit error rate. To run this test, FEC must be disabled in the Serdes PHY retimers. To disable FEC, access to the system’s BMC is required. Each OEM vendor has its own procedures for accessing BMC and disabling FEC. The test execution time could be lengthy and depends on the state of the Intel Gaudi Linux kernel driver. During this test, the devices undergo hard resets to perform various stages of the test. The test plugin needs to wait for the device to reach a stable state of operation after each hard reset.

To specify the ports for testing on an HLS2 server, you can modify the default file named hls2-ber_test_config.json. Enabling a port requires setting its value to 1, while disabling it requires setting its value to 0. You can also supply your own configuration file to accommodate custom configurations.

The following illustrates the default contents of hls2-ber_test_config.json file indicating that there are 8 cards and 24 ports per card on the machine.

{
   "0" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
   "1" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
   "2" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
   "3" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
   "4" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
   "5" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
   "6" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
   "7" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
}

The information in the following line indicates that card 5 has a total of 24 ports. Among these, the value of ports 0-7 and 9-21 is set to 1, indicating that they are currently being tested. However, the value of ports 8, 22, and 23 is set to 0, indicating that they are not being tested.

"5" : [1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],

After running this test, make sure to enable FEC for normal mode.

SER test - Pass/fail Criteria

The pass/fail criteria depends on the calculated post FEC SER. The SER plugin pass/fail criteria for this mode is 1.0E-6.

Gaudi 2 SER Test Plugin Switches and Parameters

hl_qual -gaudi2 -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-waitTime <time in seconds> ] [-mon_cfg <monitor INI path>] -ser [-config_file <path to configuration file>]

Switches and Parameters

Description

-ser

SER test selector

code

‘-ber_enable’

enabling BER check

-config_file

Provides a custom configuration file to specify the ports that will be tested.

-waitTime <time in seconds>

To achieve stable BER results, it is recommended to introduce a time delay of 200 seconds after a hard reset to allow the device to stabilize.

./hl_qual -gaudi2 -c all -rmod parallel -dis_mon -ser
./hl_qual -gaudi2 -c all -rmod parallel -dis_mon -ser -config_file filepath.json