Bandwidth Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

Memory Bandwidth Plugin Design Consideration and Responsibilities

Note

The memory bandwidth test plugin is applicable for both first-gen Gaudi and Gaudi2.

The memory bandwidth test plugin is an hl_thunk based DMA bandwidth measurement test. The test includes PCI bandwidth and a variation of DRAM and SRAM memory transfer tests. The tests are built on top of the hlthunk API - a lower level API wrapping the Habana driver.

Memory Bandwidth Testing Modes

  • PCI tests:

    • HOST ==> HBM

    • HBM ==> HOST

    • HBM <==> HOST - bidirectional test

    • HOST ==> SRAM

    • SRAM ==> HOST

    • SRAM <==> HOST - bidirectional test

  • Device memory tests:

    • SRAM ==> HBM, using single and multiple DMA channels

    • HBM ==> SRAM, using single and multiple DMA channels

    • HBM ==> HBM, using single and multiple DMA channels

Serial and Parallel Running Mode Considerations

As the PCI link is common to all devices, running PCI HOST/DEViCE variants of the test on multiple devices in parallel mode will cause the test to fail. The failure occurs due to splitting the PCI BW between devices, which yields low BW results. Hence, the PCI test variants must be run in serial mode, while the memory bandwidth test which does not involve host to device communication must be run in parallel mode. Running in different modes is important as memory test has longer running time and running the test on multiple devices can shorten the total execution time.

Split the test into two:

  • Running memory test in parallel mode example:

./hl_qual -c all -rmod parallel -mb -memOnly -gaudi2
  • Running PCI test in serial mode example:

./hl_qual -c all -rmod serial -mb -b -pciOnly -sramOnly -gaudi2

Memory Bandwidth Test - Pass/fail Criteria

The pass/fail criteria depends on the sub test type executed by the user. It should be noted that memory to memory tests are pre-calibrated, hence their pass/fail criteria cannot be changed and it is reported in the hl_qual report.

The PCI sub test is similar to the SynapseAI PCI load test pass/fail criteria. The following lists some assumptions made in the test plugin code:

  • The full PCI path between HOST CPU and device with predefined GEN3 setup, this can be changed by the user.

  • The full PCI path between HOST CPU and device is composed out of 16 lanes, this can be changed by the user.

A user conducting test on a different host setup can change these assumptions by using the applicable test switches, which will cause the test plugin to change the pass/fail criteria accordingly.

The below pass criteria numbers are precalculated for a setup with 16 lanes and GEN3 compatibility:

  • Unidirectional download from host to device with an expected bandwidth of 11.9GB/s, assuming CPU with Gen-3 PCI link.

  • Unidirectional upload from device to host with an expected bandwidth of 12.9GB/s, assuming CPU with Gen-3 PCI link.

  • Bidirectional test which calculates the bandwidth on a simultaneous upload and download with an expected bandwidth of 19.5GB/s.

The calculated pass/fail criteria threshold are theoretical, hence the PCI test has a 10% allowable degradation. below that the plugin will fail the test run.

Memory Bandwidth Test Plugin Switches and Parameters

hl_qual -gaudi|-gaudi2  -c <pci bus id> -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
         [-n] [-b]  [-memOnly | -pciOnly | -sramOnly] -mb
  • -mb - Memory bandwidth test selector.

  • -n - Cancels checking pass fail criteria for bandwidth speed.

  • -b - Activates Bidirectional bandWidth tests.

  • -memOnly - Only activates device memory tests:

    • SRAM ==> DRAM

    • DRAM ==> SRAM

    • DRAM ==> DRAM

  • -pciOnly - Only activates PPC tests:

    • HOST ==> DRAM

    • DRAM ==> HOST

    • DRAM <==> HOST [only active with -b flag]

  • -sramOnly - Only activates PPC tests:

    • HOST ==> SRAM

    • SRAM ==> HOST

    • SRAM <==> HOST [only active with -b flag]

./hl_qual -gaudi -c all -rmod parallel -dis_mon -b -mb

The above command line executes the memory bandwidth with all the test and with a pass fail criteria for transfer speed.

Note

  • -pciOnly, -memOnly and -sramOnly are optional. When these switches are not specified, a full test run including all available tests is executed. When a full mode run is executed, the test must run in serial mode.

  • -sramOnly and -pciOnly must be run using serial mode -rmod serial.

  • -memOnly can be run in serial or parallel mode, but any combination of -memOnly that also includes -sramOnly or -pciOnly must be run in serial mode.

Valid examples:

./hl_qual -c all -rmod parallel -mb -memOnly -gaudi2

./hl_qual -c all -rmod serial -mb -gaudi2

./hl_qual -c all -rmod serial -mb -b -pciOnly -sramOnly -gaudi2

None valid examples:

./hl_qual -c all -rmod parallel -mb -gaudi2

./hl_qual -c all -rmod parallel -mb -b -pciOnly -gaudi2

./hl_qual -c all -rmod parallel -mb -b -pciOnly -sramOnly -gaudi2

PCI Bandwidth Test Design Considerations and Requirements

Note

The PCI bandwidth test plugin is applicable for both first-gen Gaudi and Gaudi2.

The PCI bandwidth test plugin measures the PCI bandwidth when moving data from the host to/from the device HBM memory. The hl_qual can run this test using the two running mode specified in hl_qual Design.

When running this test in serial mode, each device should achieve maximal bandwidth.

The load PCI plugin pass/fail criteria for this mode is 11.04GB/s, assuming the host CPU port is PCIe Gen3 x16.

The test runs are partitioned into three sub test upload/download/bidirectional. This imposes certain restrictions on PCI test time duration. For example, if you run the test for 20 seconds in serial mode on an 8 first-gen Gaudi machine the actual test duration will be:

  • Total_test duration = 20 * 3 * 8 = 240 seconds.

To average the bandwidth calculation, 10-20 seconds is sufficient (about 200 GB for upload/download). Running the test beyond 1 minute will increase the test duration.

Parallel Mode

In this mode the pass/fail criteria is determined as follows:

  • For each path <P> to Habana devices <D1, D2, …> the in PCIe tree:

    1. Extract the bottle-neck <B> switch/bridge/device in P.

    2. If ( BandWidth(B) <= Sum(Test(D1) + Test(D2) + …) ) == True, then PASS else FAIL.

For example:

../../_images/PCIe_tree_example.PNG

Figure 37 Example: PCIe tree

  • Path A : Pass if BandWidth(Second Bridge) <= Sum(Test(D1) + Test(D2)).

  • Path B : Pass if BandWidth(first Bridge) <= Sum(Test(D1)).

  • Path C : Pass if BandWidth(D1) <= Sum(Test(D1)).

  • Path D : Pass if BandWidth(Second Switch) <= Sum(Test(D1) + Test(D2) + Test(D3) + Test(D4)).

Define BandWidth = (width / 8.0) * speed

This feature is used in two ways depending on if you are using a Virtual Machine or Bare Metal setup.

Bare Metal

Use Linux PCIe information. In this scenario, the test will compare the card’s performance against the switch/bridge bottle-neck in cards connectivity paths.

Virtual Machine

Two files are required:

  • A TXT file that contains the lspci tree of the bare-metal machine the VM runs on. Use:

sudo lspci -vt > <<PATH-TO-QUAL-FOLDER>>/gaudi/bin/pci-tree.txt
  • A JSON file that contains the speed and width information of all switches and bridges in the Habana cards PCIe paths (including the actual Habana cards). For example:

{
   "switches": [
     {
       "address": "5d:00.0", "Speed": "8GT/s", "Width": "x16"
     },
     {
       "address":"5e:00.0",  "Speed": "8GT/s", "Width": "x16"
     },
     {
       "address":"5f:00.0", "Speed":"16GT/s", "Width":"x16"
     },
       {
       "address":"60:00.0", "Speed":"16GT/s", "Width":"x16"
     },
       {
       "address":"61:00.0", "Speed":"16GT/s", "Width":"x16"
     }
   ]
}

Note

For virtual machines, the user must supply the above json as the PCI data is currently not reflected between the bare-metal and virtual machines.

Use the following filename: switches-db.json and place it in the qual binaries folder.

In this scenario, the test will use the client files to compare the card’s performance against the switch/bridge bottle-neck in card connectivity paths.

SynapseAI PCI test - Pass/fail Criteria

The PCI test plugin runs multiple tests to check the download and upload bandwidth with the following pass/fail criteria:

  • Unidirectional download from host to device with an expected bandwidth of 11.6GB/s assuming CPU with Gen-3 PCI link.

  • Unidirectional upload from device to host with an expected bandwidth of 12.9GB/s assuming CPU with Gen-3 PCI link.

  • Bidirectional test which calculates the bandwidth on a simultaneous upload and download with an expected bandwidth of 19.9GB/s.

The calculated pass/fail criteria threshold are theoretical, hence the PCI test has a 10% allowable degradation.

PCI Bandwidth Test Plugin Switches and Parameters

hl_qual -gaudi -gaudi2 -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -p [-b] [-n] [-size <size in bytes> ] [-gen gen_specifier] [-lanes lane_specifier]
  • -p - PCI test plugin selector.

  • -b - Enables bidirectional PCI test, simultaneous upload and download test. By default, if this switch is not specified, the test plugin will perform only two upload and download bandwidth tests. When this switch is omitted, the default behavior is to skip the bidirectional test and perform an upload and download bandwidth check.

  • -n - Disables bandwidth checks. This option is useful when bandwidth calculation in parallel mode is required where all devices are simultaneously being tested. When this switch is omitted, the PCI bandwidth plugin will conduct a bandwidth validation check.

  • -size <buffer size in bytes> - Upload/Download buffer size specification. The minimal buffer size must be 536870912. If this switch is omitted, the default upload/download buffer size is 200MB.

    ./hl_qual -gaudi -c all -rmod serial -t 20 -p -b -size 102400000
    
  • -gen <gen modifier> - Specifies the expected PCI device generation. There are two applicable modifiers:

    • gen3 - The test is running on a Gen-3 PCI system (Host + Habana device).

    • gen4 - The test is running on a Gen-4 PCI system (Host + Habana device).

    ./hl_qual -gaudi -c all -rmod serial -t 20 -p -b -gen gen3
    ./hl_qual -gaudi -c all -rmod serial -t 20 -p -b -gen gen4
    ./hl_qual -gaudi2 -c all -rmod serial -t 20 -p -b -gen gen4
    

    If this switch is omitted, it is assumed that the system under test (HOST + Habana device) is a Gen-3 PCI data path.

  • -t - PCI test duration in seconds. If this switch is omitted, the default value is 40 seconds.

Note

Since the PCI bandwidth test plugin conducts up to 3 sub-tests (upload, download and bidirectional), the duration given in the command line should be multiplied by 3.

./hl_qual -c all -rmod serial -t 20 -p -b -pl_cfg config.ini
  • parallel mode:

./hl_qual -c all -rmod parallel -t 20 -p -b -n -pl_cfg config.ini

Note

For system with malfunctioning PCI link (such as low BW), the test duration will deviate for the specified by the -t option as the PCI plugin calculates the number of test iterations according to the expected BW.