Functional Test Plugins Design, Switches and Parameters

This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.

Functional tests verify the full chip functionality while running several chip hardware modules in parallel and in a synchronized manner. All the tests described in this section verify the accuracy of calculation in parallel to performance metrics in the form of a measured frame per second. The accuracy check executed on host, will not affect the FPS measurement of the test.

ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities

Note

  • The ResNet-50 training stress test plugin is applicable for both first-gen Gaudi and Gaudi2.

  • Before running this plugin, make sure to set the following environment variable: export  __python_cmd=python3.

The ResNet-50 training stress test plugin runs a functional ResNet-50 training test as a real life training scenario. The test verifies accuracy and performance. To enable an accuracy check, the user must supply a full ImageNet training data set.

ResNet-50 Training Stress Test Plugin Testing Modes

  1. The ResNet-50 training stress test plugin has two testing modes, each with different batch size options:

    • 64 batch size

    • 256 batch size

  2. Random Data vs ImageNet verification:

    • Random Data - The test can run on random data tensors. When using this mode the accuracy check is skipped and only achievable FPS is taken into account in the pass/fail criteria. It enables pure FPS test without depending on the image augmenter. This test is limited to 1000 iterations.

    • ImageNet - When using ImageNet dataset, the test will evaluate accuracy and FPS. The FPS could be influenced by the image augmenter AEON in case of first-gen Gaudi and Habana Media Pipe in case of Gaudi2. To prevent performance degradation, refer to the note in Test Differences Between First-gen Gaudi and Gaudi2 section.

The suggested number of training epochs should not exceed 90. The ResNet50 training app should converge with that range. 90 epoch represents 20-21 hours of running.

Test Differences Between First-gen Gaudi and Gaudi2

The purpose of this test for both first-gen Gaudi and Gaudi2 is the same, however, the test execution is different as Gaudi2 contains a H.264/Jpeg decoder accelerator. The following lists the main test differences:

  • Gaudi2 test variant needs to test the decoder HW path.

  • The augmentation and image preprocessing between Gaudi2 and first-gen Gaudi are different:

    • First-gen Gaudi - Uses AEON augmenter which uses host CPU.

    • Gaudi2 - Uses the Habana media pipeline meaning Jpeg decoding and image preprocessing is done on the Gaudi2 device.

The impact of the above difference is that Gaudi2 test is less dependent on PCI link BW as it sends compressed images when running on multiple devices and is less dependent on host CPU resources.

ResNet-50 and Linux File Cache Considerations

Linux OS has a file cache that accelerates IO operations. When a file is read for the first time, it is loaded to the file cache allocated on the host RAM so that the subsequent reads of this file will be faster. As a result, when running the training plugin with the full ImageNet validations set, which has the size of 140GB, the first training epoch performance could be lower than expected. To address this issue, perform one of the following:

  • Run a dummy training run for a full epoch and ignore the failure. This dummy run will upload the full ImageNet dataset to the Linux cache so that the following run will have the expected performance.

  • Use the iocache_loader application supplied in the hl_qual installation folder to upload the ImageNet dataset to the Linux cache.

Note

After rebooting the host, the Linux file cache is cleared so that one of the above options can be executed to achieve the expected performance.

Running iocache_loader Application

The following is iocache_loader command line interface. Receive the applicable switches by running without the command line parameters.

$ ./iocache_loader
path to training dataset must be supplied
./iocache_loader -p <path> -t <num of threads> -e
-p <path> - path to dataset - this is mandatory
-t <num of threads> - number of threads default 20
-e enable output printouts
Copy to clipboard

For example:

./iocache_loader -p /user/imagenet/train -t 40
Copy to clipboard

ResNet-50 training - Pass/Fail Criteria

The performance and accuracy test evaluates the loss function received from the device. If the loss function shows an unexpected behavior, the test will fail. The test plugin verifies that the loss is decreasing through the epochs and converges behaviors according to the expected rate for the ResNet50 training process. It also verifies that there are no sharp jumps between iterations.

The performance [images/sec] is calculated per training epoch. The expected results per core:

FPS: 1580 images/sec Epoch runtime: 13.3 minutes

FPS: 5750 images/sec Epoch runtime 3.6 minutes

Expected accuracy: for 90 epoch run - 0.743.

Note

The trainingApp test uses the AEON augmenter which could be a limiting factor on the achievable FPS results when running on multiple devices. To enable the test to runs on all 8 devices, the user’s host machine should include:

  • Two NUMA nodes

  • 96 CPU cores (Minimum), evenly distributed between the NUMA nodes.

  • 384 GB of RAM.

For running on smaller sets of devices, the above number could be reduced.

ResNet-50 Training Stress Test Plugin Switches and Parameters

The following lists the training test plugin switches and parameters:

**hl_qual -gaudi -gaudi2 -c <pci bus id> [-rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -trainingApp [-bs <batch size 64 | 256>] [-epoch <number of epochs>] [-n <number of iterations>] [-rand] **
Copy to clipboard

Switches and Parameters

Description

-trainingApp

Training test plugin selector.

-bs <64 | 256>

Defines the training batch size.

  • 64 - Batch size 64

  • 256- Batch size 256

If the value is not specified, the default value is 256.

-epoch

Defines epoch count.

-n

Defines iteration count. You must provide -epoch flag with this flag. Example: -epoch 1 -n 1000.

-rand

Random input generation. This mode disables accuracy and loss validation. Only fps is calculated. The test uses 1500 iterations preset.

-log

Writes statistics to file.

./hl_qual -gaudi -c all -rmod parallel -trainingApp -bs 256 -rand
./hl_qual -gaudi -c all -rmod parallel -trainingApp -bs 256 -epoch 3
./hl_qual -gaudi2 -c all -rmod parallel -trainingApp -bs 256 -epoch 3
Copy to clipboard

ResNet-50 Training Stress Test Plugin Configuration Files and Requirements

Before using the plugin, make sure to perform the following:

  • Download the training/validation dataset from ImageNet.

  • Download the labels file from https://www.kaggle.com/c/imagenet-object-localization-challenge/data?select=LOC_synset_mapping.txt

  • Untar the imagenet tar file (ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar).

  • Change your current directory to hl_qual bin directory.

  • Run the preparation script - prepare.sh (The script are included in the package). This will untar all the tar files (ILSVRC2012_img_train.tar file) and generate the training list file (train_list.txt).

./prepare.sh -m MODE -d EXTRACTED_DIR -f LABEL_FILE [-h]`
Copy to clipboard

Parameters:

  • MODE - ‘train’ or ‘val’.

  • EXTRACTED_DIR - The path to the directory that contains the untared files from ILSVRC2012_img_train.tar file.

  • LABEL_FILE - The path to LOC_synset_mapping.txt file.

Note

IMPORTANT: Expected execution time depends on the number of epochs configured for the test run. Each epoch can take up to 18 minutes.

Please make a copy of the following files (can be used after installing new package on the same setup):

  • train_list.txt

  • training256.json

  • training64.json

Reinstalling the package or relaunching the script will override/overwrite those files.

Functional Test 2 Plugin Design Consideration and Responsibilities

Note

Functional test 2 plugin is applicable for both first-gen Gaudi and Gaudi2.

The functional test 2 runs all available hardware components on the first-gen Gaudi and Gaudi2 SOC to test the functionality and the interaction between the different units during parallel execution. When using parallel execution, the test plugin will run on all hardware components simultaneously.

The functional test uses synthetic topology which introduces multiple operations that ensure using all computational units and all available memories while introducing high power usage. The output of each topology run is verified against a pre-calculated reference to verify bit exact results.

The test can run for long hours and test the following device functionalities:

  • Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms can be checked while running power stress plugin in extreme load.

  • PID and clock relaxation mechanisms verification

  • Long work periods in typical power workloads (extreme, high)

  • Full bit-exact calculation

Tested units:

  • PCI links

  • DMA engines – moving data between:

    • PCI ==> HBM, HBM ==>PCI

    • HBM ==> SRAM, SRAM==>HBM

  • MME engines

  • TPC engines

  • Serdes connectivity - only when using -serdes switch

Functional Test 2 Testing Modes

The functional test purpose is to enable high power consumption while verifying the calculation result on each topology execution (all execution steps are been verified).

The functional test contains the following sub-test modes:

  1. Extreme - measured power level: 345-355 [watt]

  2. High – measured power level: 200-230 [watt]

  1. Extreme - measured power level for 54V power supply: 530-560 [watt]

  2. High – measured power level for 54V power supply: 370-420 [watt]

The measurement above is recorded from a 4 minutes run. This can change depending on the environmental status of the system (fan speed, server box configuration and ambient temperatures).

The functional test 2 plugin builds a test topology including large tensors and multiple operators (Conv, Batchnorm). When applying the -serdes the topology include full serdes receive tensor verification graph include sub, L1 norm.

The test application runs the topology on each test iteration by injecting pre-calculated inputs and compares the output against a pre-calculated reference for each topology execution on the device.

Note

The initialization stage can take up to 170 seconds. This is required to recalculate and generate the reference expected output tensors, compile the test topology and test runtime execution calibration. The init time is not included in the test running duration specified by the user when using -t switch.

Functional test - pass/fail Criteria

The pass/fail criteria is composed of the following:

  • The calculated value of each topology launch must be identical to a pre-calculated reference.

  • The execution throughput [executions/seconds] must not fall below an existing predefined threshold:

  1. Extreme - FPS 260 [Frame/Sec]

  2. High – FPS 270 [Frame/Sec]

  1. Extreme - FPS 750 [Frame/Sec], measured on HLS2 server

  2. High – FPS 880 [Frame/Sec], measured on hls2 server

The measurement above is recorded from a 4 minutes run.

Functional Test 2 Plugin Switches and Parameters

hl_qual -gaudi|-gaudi2 -c <pci bus id> [-t <time in seconds>]  -rmod <serial | parallel>  [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 -l <extreme | high> [-d] [-dis_val] [-serdes]
Copy to clipboard

Switches and Parameters

Description

-f2

Functional test 2 plugin selector.

-d

Download once option. The input tensors are downloaded to the device at the beginning of the test and reused for all test iterations. This switch is useful when the user suspects that functional test performance degradation is due to PCI low BW.

-dis_val

Disables output tensor validation. The test will not fail on bit exact test, but may fail on low FPS. When using this switch, the test performance will be higher as the data is not uploaded to the host for verification.

-serdes

Enables running an allreduce collective operation to test the NIC in parallel to the regular functional test.

-l <extreme | high>

  • First-gen Gaudi power level selector:

    • extreme - 345-355 [w] measured on HL-205

    • high - 200-230 [w] measured on HL-205

  • Gaudi2 power level selector for 54V power supply:

    • extreme - 530-560 [w] measured on HL-225H

    • high - 370-420 [w] measured on HL-225H

./hl_qual -gaudi -c all -rmod parallel -f2 -d -l high
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes
Copy to clipboard