Connectivity Serdes Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

Serdes Base Test Design Consideration and Responsibilities

The Serdes base test performs basic sanity tests on the Serdes, ensuring the following parts are tested:

  • Port connectivity

  • RX/TX data integrity

RX/TX data integrity is checked by pre-calculated random data and compared against a reference data. The default transmitted buffer size is 128MB to ensure a variety of transmitted data. During the test, the TX buffer is transmitted multiple times according to the test execution time you set. The test can check inter-box connectivity between the different devices using a pairs test. A pairs test takes all device pairs and runs a data integrity test on each pair of devices.

Serdes Base Test Testing Modes

The test plugin consists of the below sub-testing modes:

  • Loopback test - Transmits and receives from the same rank (device) by using a loopback dongle connected to all ports.

  • External loopback - This test should be used when the port configuration has a few internal ports used to connect the different devices within the server box and external ports used to connect devices from different boxes. To run this test, the external ports must be fitted with loopback dongles while the internal ports should be disabled using the disable_ports.

Note

Loopback and External loopback tests are applicable only for first-gen Gaudi devices. Gaudi2 is not supported.

  • Pair test - This test runs over all available device pairs and tests connectivity.

  • ext_pairs - This test runs on external ports when applicable (which are used to scale-out between different boxes), in this mode these ports should be connected to an external switch.

  • All-Gather - This test runs over all available devices and computes the bandwidth of all-gather functionality.

  • All-Reduce - This test runs over all available devices and computes the bandwidth of all-reduce functionality.

Pass/fail Criteria

The pass/fail criteria is composed out of two sub-criteria:

  • Connectivity - The test fails if the destination rank does not respond.

  • The test fails if data is not received as expected, compared with the reference data.

The All-Gather and All-Reduce variants also calculate the BW for those network operations. You may run the test for a specific number of epochs and specific number of iteration per epoch. This will generate a MAX, MIN and Average BW per epoch which may expose BW instabilities to the time duration of the test.

Note

Both All-Gather and All-Reduce calculate BW for NIC port integrity only. They do not validate NIC port integrity. The user should pair as follows: test ext_pairs.

Serdes Base Test Plugin Switches and Parameters

First-gen Gaudi and Gaudi2 test variants differ in the test capabilities as demonstrated in the command line below:

hl_qual -gaudi -c <pci bus id> [-i <inner loop iterations count>] [-ep <epoch count>] -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -nic_base  -test_type <loopback, ext_loopback, pairs, ext_pairs, allreduce, allgather> [-pl_cfg <plugin INI config path>] [-sz <size in bytes>] [-disable_ports <port list>] [-seed <seed>]
  • -test_type <test type> - Defines the test type and port configuration to be used in the user system. The different test type variants are:

    • loopback - Loopback test. All device ports must be fitted with a loopback dongle (all Gaudi NICs are external, for example HLS1-H server topology).

    • ext_loopback - In this test all internal ports interconnected between first-gen Gaudi devices are disabled. The external ports going outside of the server box are fitted with loopback dongle. To disable the internal ports disabling, use disable_ports.

    • pairs - This test checks the internal port connectivity.

    • ext_pairs - This test checks the internal port connectivity. [In this test case all internal ports interconnected between first-gen Gaudi devices are disabled.]

    • allgather - This test computes the all-gather bandwidth.

    • allreduce - This test computes the all-reduce bandwidth.

  • -pl_cfg <plugin INI config path> - Specifies test INI configuration file path.

hl_qual -gaudi2 -c <pci bus id> [-i <inner loop iterations count>] [-ep <epoch count>] -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -nic_base  -test_type <pairs, ext_pairs, allreduce, allgather> [-pl_cfg <plugin INI config path>] [-sz <size in bytes>] [-disable_ports <port list>] [-seed <seed>]
  • -test_type <test type> - Defines the test type and port configuration to be used in the user system. The different test type variants are:

    • pairs - This test checks the internal port connectivity.

    • ext_pairs - This test checks the internal port connectivity. [In this test case all internal ports interconnected between first-gen Gaudi devices are disabled.]

    • allgather - This test computes the all-gather bandwidth.

    • allreduce - This test computes the all-reduce bandwidth.

  • -nic_base - Serdes base test selector.

  • -sz - Send/receive buffer size in bytes.

  • -t - NIC_BASE test duration in seconds. If this switch is omitted, the default value is 20 seconds -t is applicable only for pair test and loopback.

  • -i - Iteration count For bandwidth computation.

  • -ep - Epoch count For bandwidth computation.

Note

This switch enables multiple epochs where each epoch is composed out of iterations and specified with -i switch. This enables bandwidth calculation over long runs where the BW of each iteration is averaged over each epoch. The switch is applicable only for allgather and allreduce test modes.

  • --seed <seed> -Seed value for generating the transmit patterns. This is a 32 bit number with the following hexadecimal pattern xxxxxxxx (for example, abc123da). The default pattern is 5a5a5a5a5a.

  • disable_ports- Specify which ports to disable. Example: -disable_ports [1,2,3].

Examples:

-  ./hl_qual -gaudi -c all -rmod parallel -i 20 -nic_base -test_type pairs
-  ./hl_qual -gaudi -c all -rmod parallel -i 20 -ep 20 -nic_base -test_type allreduce
-  ./hl_qual -gaudi2 -c all -rmod parallel -i 50 -nic_base -test_type pairs
-  ./hl_qual -gaudi2 -c all -rmod parallel -i 50 -ep 100 -nic_base -test_type allreduce
-

E2E Serdes Test Plugin Design Consideration and Responsibilities

Note

The E2E Serdes test plugin is applicable for both first-gen Gaudi and Gaudi2.

The E2E serdes test is a multi-purpose test that can check any machine topology using a connectivity json that describes the machine’s unique connectivity and verifies both connectivity and BandWidth. The test runs on multiple Habana cards testing the ports simultaneously and reporting the collective bandwidth.

E2E Serdes Test Testing Modes

  • LOOPBACK - A serdes loopback communication test. This test runs on all external ports of the NICs and sends data through a loopback dongle from each external port back to itself. When running this test mode, the external ports must be connected (RX to TX) using a loopback dongle.

  • Internal - This mode sends data from each internal port to its connected port on a different first-gen Gaudi device.

  • All-lpbk - This mode simultaneously runs both the test on all the Internal ports and the test on all External ports. The test of the external ports is loopback and so all external ports must be connected (RX to TX) using a loopback dongle.

Pass/fail Criteria

The test currently fails only when there is no E2E connection between ports or loopback. The collective bandwidth is printed at the end of the test. When the test runs in Sanity mode using the -executeSanity flag it fails (in case bit error rate is higher than 0.1%).

E2E Serdes Test Plugin Switches and Parameters

sudo hl_qual -gaudi|-gaudi2 -e2e -port_map <json name> -test_mode <all-lpbk | internal | external-lpbk> -c <pci bus id> -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
  • -e2e - Serdes base test selector.

  • -port_map <name of configuration name> - A file that defines the ports connectivity within the server box and which ports are external.

  • -test_mode <test type> - Defines the test type and port configuration to be used in the user system. The different test type variants are:

    • all-lpbk - Executes both internal and external tests. This mode only runs with -rmod parallel.

    • internal - Checks the internal port connectivity. This mode only runs with -rmod parallel.

    • external-lpbk - Tests all external ports, and assumes loopback dongle on all external ports.

sudo ./hl_qual -c all -test_mode all-lpbk -rmod parallel -e2e -port_map hls1.json -gaudi

The above command line runs the E2E test on both external ports with loopbacks dongles and internal ports. The network topology is defined by hls1.json.