Functional Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

Functional tests verify the full chip functionality while running several chip hardware modules in parallel and in a synchronized manner. All the tests described in this section verify the accuracy of calculation in parallel to performance metrics in the form of a measured frame per second. The accuracy check evenly, when executed on host, will not affect the FPS measurement of the test.

ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities

Note

The ResNet-50 training stress test plugin is applicable for both first-gen Gaudi and Gaudi2.

The ResNet-50 training stress test plugin runs a functional ResNet-50 training test as a real life training scenario. The test verifies accuracy and performance. To enable an accuracy check, the user must supply a full ImageNet training data set.

ResNet-50 Training Stress Test Plugin Testing Modes

  1. The ResNet-50 training stress test plugin has two testing modes, each with different batch size options:

    • 64 batch size

    • 256 batch size

  2. Random Data vs ImageNet verification:

    • The test can run on random data tensors, when using this mode the accuracy check is skipped and only achievable FPS is taken into account in the pass/fail criteria. This test enables pure FPS test without depending on the image augmenter (AEON).

    • ImageNet - When applying ImageNet dataset, the test will evaluate accuracy and FPS. The FPS could be influenced by the image augmenter (AEON) which runs the image pre-process on the Host. To prevent performance degradation, refer to the note in Test Differences Between First-gen Gaudi and Gaudi2 section.

The suggested number of training epochs should not exceed 90. The ResNet50 training app should converge with that range. 90 epoch represents 20-21 hours of running.

Test Differences Between First-gen Gaudi and Gaudi2

The purpose of this test for both first-gen Gaudi and Gaudi2 is the same, however, the test execution is different as Gaudi2 contains a H.264/Jpeg decoder accelerator. The following lists the main test differences:

  • Gaudi2 test variant needs to test the decoder HW path.

  • The augmentation and image preprocessing between Gaudi2 and first-gen Gaudi are different:

    • First-gen Gaudi - Uses AEON augmenter which uses host CPU.

    • Gaudi2 - Uses the Habana media pipeline meaning Jpeg decoding and image preprocessing is done on the Gaudi2 device.

The impact of the above difference is that Gaudi2 test is less dependent on PCI link BW as it sends compressed images when running on multiple devices and is less dependent on host CPU resources.

Pass/Fail Criteria

The performance and accuracy test evaluates the loss function received from the device. If the loss function shows an unexpected behavior, the test will fail. The test plugin verifies that the loss is decreasing through the epochs and converges behaviors according to the expected rate for the ResNet50 training process. It also verifies that there are no sharp jumps between iterations.

The performance [images/sec] is calculated per training epoch. The expected results per core:

FPS: 1750 images/sec Epoch runtime: 13.3 minutes

FPS: 5850 images/sec Epoch runtime 3.6 minutes

Expected accuracy: for 90 epoch run - 0.743

Note

The trainingApp test uses the AEON augmenter which could be a limiting factor on the achievable FPS results when running on multiple devices. To enable the test to runs on all 8 devices, the user’s host machine should include:

  • Two NUMA nodes

  • 96 CPU cores (Minimum), evenly distributed between the NUMA nodes.

  • 384 GB of RAM.

For running on smaller sets of devices, the above number could be reduced.

ResNet-50 Training Stress Test Plugin Switches and Parameters

The following lists the training test plugin switches and parameters:

**hl_qual -gaudi -gaudi2 -c <pci bus id> [-rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -trainingApp [-bs <batch size 64 | 256>] [-type <application type training | validation>] [-epoch <numbe of epochs>] [-n <number of iterations>] [-rand] [-pl_cfg <plugin INI config path>]**
  • -trainingApp - Training test plugin selector.

  • -bs <64 | 256> - Defines the training batch size.

    • 64 - Batch size 64

    • 256- Batch size 256

If the value is not specified, the default value is 256.

  • -type <training | validation> - Defines the type of data set.

    • training - training set

    • validation - validation set

If the value is not specified, the default value is training.

  • -epoch - Defines epoch count.

  • -n - Defines iteration count. You must provide -epoch flag with this flag. Example: -epoch 1 -n 1000.

  • -rand - Random input generation. This mode disables accuracy and loss validation. Only fps is calculated. The test uses 1500 iterations preset.

  • -log - Writes statistics to file.

  • -pl_cfg <INI config file path> - Enables specifying an INI configuration file to configure the training test plugin.

./hl_qual -gaudi -c all -rmod parallel -trainingApp -bs 256 -rand
./hl_qual -gaudi -c all -rmod parallel -trainingApp -bs 256 -epoch 3 -type training
./hl_qual -gaudi2 -c all -rmod parallel -trainingApp -bs 256 -epoch 3 -type training

ResNet-50 Training Stress Test Plugin Configuration Files and Requirements

Before using the plugin, make sure to perform the following:

  • Download the training/validation dataset from ImageNet.

  • Download the labels file from https://www.kaggle.com/c/imagenet-object-localization-challenge/data?select=LOC_synset_mapping.txt

  • Untar the imagenet tar file (ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar).

  • Change your current directory to hl_qual bin directory.

  • Run the preparation script - prepare.sh (The script are included in the package). This will untar all the tar files (ILSVRC2012_img_train.tar file) and generate the training list file (train_list.txt).

./prepare.sh -m MODE -d EXTRACTED_DIR -f LABEL_FILE [-h]`

Parameters:

  • MODE - ‘train’ or ‘val’.

  • EXTRACTED_DIR - The path to the directory that contains the untared files from ILSVRC2012_img_train.tar file.

  • LABEL_FILE - The path to LOC_synset_mapping.txt file.

Note

IMPORTANT: Expected execution time depends on the number of epochs configured for the test run. Each epoch can take up to 18 minutes.

Please make a copy of the following files (can be used after installing new package on the same setup):

  • train_list.txt

  • training256.json

  • training64.json

  • validation256.json

  • validation64.json

Reinstalling the package or relaunching the script will override/overwrite those files.

First-gen Gaudi Functional Test Plugin Design Consideration and Responsibilities

Note

The functional test plugin is applicable only for first-gen Gaudi devices. Gaudi2 is not supported.

The functional test runs all available hardware components on the first-gen Gaudi SOC to test the functionality and the interaction between the different units during parallel execution. When using parallel execution, the test plugin will run on all hardware components simultaneously. The following are the tested units:

  • PCI links

  • DMA engines – moving data between:

    • PCI ==> HBM, HBM ==>PCI

    • HBM ==> SRAM, SRAM==>HBM

  • MME engines

  • TPC engines

  • Serdes

Functional Test Testing Modes

The functional test contains the following sub-test modes:

  • Simple mode – The test runs a topology which checks the PCI, DMA, MME and TPC units. Serdes communication is not tested.

  • LOOPBACK – On top of the simple topology, a Serdes loopback communication test was added. This test includes a verification topology that is executed on the device. When running this test mode, the device must be connected (RX to TX) using a loopback dongle.

  • AllGather - This mode is built on top of the simple functional test. It enables an advanced AllGather Serdes test. During this test, each first-gen Gaudi device sends data to all other available devices in the server. The received data from all the first-gen Gaudi devices is verified using a predefined topology and compared with expected data to verify RX/TX integrity.

  • AllReduce - This mode is built on top of the simple functional test. It enables an advanced Allreduce Serdes test. During this test, each first-gen Gaudi device sends data to all other available devices in the server. Upon receiving all the messages from all other first-gen Gaudi devices, each device performs a reduction by summing up all messages together and placing the result into the RX matrix. The results matrix is verified using a verification topology running TPC code. The final result is sent to the host for verification.

The functional test plugin builds a test topology composed of the following computation nodes:

  • Large GEMM nodes

  • Large Matrix add nodes

  • Embedding sum nodes

  • Large sub nodes

  • Reduce_L1_norm nodes

The test application runs the topology on each test iteration by injecting pre-calculated inputs and compares the output against a pre-calculated reference.

Pass/fail Criteria

The pass/fail criteria is composed of three sub-criteria:

  • The calculated value of each topology launch must be identical to a pre-calculated reference.

  • All RX/TX transmission received from other devices must be bit-exact to the expected matrix values.

  • The execution throughput [executions/seconds] must not fall below an existing predefined threshold.

First-gen Gaudi Functional Test Plugin Switches and Parameters

Note

The functional test plugin is applicable only for first-gen Gaudi based devices. Gaudi2 is not supported.

hl_qual -gaudi -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -f  [-serdes_type <none | loopback | allgather | allreduce>] [-disable_ports <port list>] [-pl_cfg <plugin INI config path>]
  • -f - Functional test plugin selector.

  • -serdes_type <serdes test definition> - This test contains three sub-test modes:

    • none - Runs the simplified functional test without Serdes testing.

    ./hl_qual -gaudi -c all -rmod parallel -f -serdes_type none
    
    • loopback - Runs the simplified functional test including Serdes loopback test. This option may also be tested on external ports. You must connect loopback dongles to external ports and disable all internal ports. To disable internal ports, use disable_ports.

    ./hl_qual -gaudi -c all -rmod parallel -f -serdes_type loopback
    

    Note

    When running this test, all ports of the first-gen Gaudi device must be connected with a loopback dongle to close the RX/TX loop.

    • allgather - Runs the simplified functional test including the Serdes Allgather test. During each test iteration, each first-gen Gaudi device sends a data buffer of 16MB to all other devices participating in the test. All first-gen Gaudi devices check the received transmission against an expected reference input buffer.

    • allreduce - Runs the simplified functional test including the Serdes allreduce test. During each test iteration, each first-gen Gaudi device sends a data buffer of 16MB to all other devices participating in the test. All first-gen Gaudi devices check the received transmission against an expected reference input buffer.

    ./hl_qual -gaudi -c all -rmod parallel -f -serdes_type allgather
    

    Note

    This test can be performed only on HLS-1 systems.

When serdes_type is not specified in the command line or configuration file, the default behavior is none which means only the simplified functional test will run.

  • -pl_cfg <INI config file path> - This switch allows specifying the path to a configuration file for test customization.

  • disable_ports- Specify which ports to disable. Example: -disable_ports [1,2,3].

./hl_qual -gaudi -c all -rmod parallel -f -pl_cfg config.ini

Functional Test 2 Plugin Design Consideration and Responsibilities

Note

Functional test 2 plugin is applicable for both first-gen Gaudi and Gaudi2.

The functional test 2 runs all available hardware components on the first-gen Gaudi SOC to test the functionality and the interaction between the different units during parallel execution. When using parallel execution, the test plugin will run on all hardware components simultaneously.

As opposed to the original functional test, this test is more aggressive and as a result can be used as a power stress.

The test can run for long hours and test the following device functionalities:

  • Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms can be checked while running power stress plugin in extreme load.

  • PID and clock relaxation mechanisms verification

  • Long work periods in typical power workloads (extreme, high)

  • Full bit-exact calculation

Tested units:

  • PCI links

  • DMA engines – moving data between:

    • PCI ==> HBM, HBM ==>PCI

    • HBM ==> SRAM, SRAM==>HBM

  • MME engines

  • TPC engines

  • Serdes connectivity - only when using -serdes switch

Functional Test 2 Testing Modes

The functional test purpose is to enable high power consumption while verifying the calculation result on each topology execution (all execution steps are been verified).

The functional test contains the following sub-test modes:

  1. Extreme - measured power level: 360 [watt]

  2. High – measured power level: 340-355 [watt]

  1. Extreme - measured power level: 550 [watt]

  2. High – measured power level: 510-540 [watt]

The functional test 2 plugin builds a test topology including large tensors and multiple operators (Conv, Batchnorm). When applying the -serdes the topology include full serdes receive tensor verification graph include sub, L1 norm.

The test application runs the topology on each test iteration by injecting pre-calculated inputs and compares the output against a pre-calculated reference for each topology execution on the device.

Note

The initialization stage can take up to 100 seconds. This is required to recalculate and generate the reference expected output tensors, compile the test topology and test runtime execution calibration. The init time is not included in the test running duration specified by the user when using -t switch.

Pass/fail Criteria

The pass/fail criteria is composed of the following:

  • The calculated value of each topology launch must be identical to a pre-calculated reference.

  • The execution throughput [executions/seconds] must not fall below an existing predefined threshold:

  1. Extreme - FPS 208 [Frame/Sec]

  2. High – FPS 256 [Frame/Sec]

  1. Extreme - FPS 665 [Frame/Sec]

  2. High – FPS 738 [Frame/Sec]

Functional Test 2 Plugin Switches and Parameters

hl_qual -gaudi|-gaudi2 -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 -l <extreme | high> [-d] [-pl_cfg <plugin INI config path>]
  • -f2 - Functional test 2 plugin selector.

  • -d - Download once option, the input tensors are downloaded to the device at the beginning of the test and reused for all test iterations. This switch is useful when the user suspects that functional test performance degradation is due to PCI low BW.

  • -l <extreme | high> -

    Power level selector:

    • extreme - 350-360 [w] measured on HL-205

    • high - 340-350 [w] measured on HL-205

    Power level selector:

    • extreme - 600-620 [w] measured on HL-225

    • high - 580-600 [w] measured on HL-225

  • -pl_cfg <INI config file path> - This switch allows specifying the path to a configuration file for test customization.

./hl_qual -gaudi -c all -rmod parallel -f2 -d -l high