Functional Test Plugins Design, Switches and Parameters

This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.

Functional tests verify the full chip functionality while running several chip hardware modules in parallel and in a synchronized manner. All the tests described in this section verify the accuracy of calculation in parallel to performance metrics in the form of a measured frame per second. Executing the accuracy check on the host does not affect the FPS measurement of the test.

ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities

Note

Before running this plugin, make sure to set export __python_cmd=python3 environment variable.

The ResNet-50 training stress test plugin runs a functional ResNet-50 training test as a real life training scenario. The test verifies accuracy and performance. To enable an accuracy check, you must supply a full ImageNet training dataset.

ResNet-50 Training Stress Test Plugin Testing Modes

  • The ResNet-50 training stress test plugin supports a single testing mode of batch size 256, which is 256 images in each batch.

  • (First-gen Gaudi only) Random Data - The test can run on random data tensors. When using this mode, the accuracy check is skipped and only the achievable FPS is taken into account in the pass/fail criteria. It enables pure FPS test without depending on the image augmenter. This test is limited to 1000 iterations.

  • ImageNet - When using ImageNet dataset, the test evaluates accuracy and FPS. The FPS could be influenced by the image augmenter AEON in case of first-gen Gaudi and Intel Gaudi Media Pipe in case of Gaudi 3 and Gaudi 2. For more details, refer to Accuracy and FPS Evaluation.

Note

For first-gen Gaudi, running the test with ImageNet dataset is the default mode. To run with random data, use the -rand switch.

The suggested number of training epochs should not exceed 90. The ResNet50 training app should converge with that range. 90 epoch runs for 20-21 hours.

Prerequisites

Make sure to install habanalabs-qual-workloads as shown in the Driver and Software Installation.

The trainingApp test uses the AEON augmenter which could be a limiting factor on the achievable FPS results when running on multiple devices. Before running this testing mode with ImageNet dataset on first-gen Gaudi, make sure you have the following:

  • 96 CPU cores per server (minimum), evenly distributed between the NUMA nodes.

  • A successful pass of the Memory Bandwidth Test and the PCI Bandwidth Test.

  • 512 GB of RAM.

Accuracy and FPS Evaluation

When running this test with ImageNet testing mode, the FPS results may be influenced by the different augmentation and image preprocessing mechanisms used for first-gen Gaudi and Gaudi 2:

  • First-gen Gaudi - Uses AEON augmenter which utilizes the host CPU.

  • Gaudi 3 and Gaudi 2 - Use the Intel Gaudi Media Pipe for Jpeg decoding and image preprocessing. The preprocessing of the ImageNet dataset is done on on the decoder HW path and it does not require CPU resources.

ResNet-50 Training Stress Test Plugin Configuration Files and Requirements

Before using the plugin, make sure to perform the following:

  • Download the training/validation dataset from ImageNet.

  • Download the labels file from ImageNet Object Localization Challenge page.

  • Untar the imagenet tar file (ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar). Use the following commands:

    • mkdir tar_folder

    • cd tar_folder

    • tar -xvf ../ILSVRC2012_img_train.tar

The above instructions assume that ILSVRC2012_img_train.tar file resides on the same folder level as tar_folder.

  • Change your current directory to hl_qual bin directory.

  • Run the preparation script - prepare.sh (included in the package). This script untars all the tar files (ILSVRC2012_img_train.tar file), and generates the training list file (train_list.txt).

    ./prepare.sh -m MODE -d EXTRACTED_DIR -f LABEL_FILE [-h]
    

Parameters:

  • MODE - ‘train’ or ‘val’.

  • EXTRACTED_DIR - The path to the directory that contains the untared files from ILSVRC2012_img_train.tar file.

  • LABEL_FILE - The path to LOC_synset_mapping.txt file.

The script generates and updates the following items:

  • train_list.txt - A file containing the location of all jpeg files and the labels associated with the image. Each line contains one image path and one label. For example: For Gaudi 3 and Gaudi 2, the path to train_list.txt should be specified using the -train_list switch.

    @FILE   STRING
    train/n01582220/n01582220_8497.JPEG     18
    train/n01582220/n01582220_11460.JPEG    18
    train/n01582220/n01582220_20482.JPEG    18
    train/n01582220/n01582220_37512.JPEG    18
    train/n01582220/n01582220_5317.JPEG     18
    train/n01582220/n01582220_4839.JPEG     18
    
  • train folder - This folder contains 1000 sub-folders each containing the images associated with the same class (label).

  • (For first-gen Gaudi) training256.json - The list below changes to include the full path to the train folder and the train_list file as generated by the prepare.sh script.

    • “manifest_filename”: “/home/labuser/builds/qual_release_build/gaudi/bin/train_list.txt”,

    • “manifest_root”: “/home/labuser/builds/qual_release_build/gaudi/bin”,

Note

Reinstalling the package or relaunching the script overrides/overwrites those files. It is highly recommended to make a copy of the following files/folders so you can restore them in case they are overwritten:

  • train_list.txt

  • train folder

  • training256.json

For first-gen Gaudi, it is important to note that you can place the train folder and the train_list.txt file in any location within the users file system as long as the training256.json file placed under /opt/habanalabs/qual/gaudi/bin is updated.

ResNet-50 and Linux File Cache Considerations

Linux OS has a file cache that accelerates IO operations. Once a file has been read, it is loaded into the host RAM’s file cache so that subsequent reads of the same file are faster. As a result, when running the training plugin with the full ImageNet validations set, which has the size of 140GB, the first training epoch performance could be lower than expected. To address this issue, perform one of the following:

  • Run a dummy training run for a full epoch and ignore the failure. During this dummy run, the full ImageNet dataset is uploaded to the Linux cache to ensure that the following run achieves the expected performance.

  • Use the iocache_loader application supplied in the hl_qual installation folder to upload the ImageNet dataset to the Linux cache.

Note

After rebooting the host, the Linux file cache is cleared so that one of the above options can be executed to achieve the expected performance.

Running iocache_loader Application

The following is iocache_loader command line interface. To receive the applicable switches, run the command without any parameters.

$ ./iocache_loader
path to training dataset must be supplied
./iocache_loader -p <path> -t <num of threads> -e
-p <path> - path to dataset - this is mandatory
-t <num of threads> - number of threads default 20
-e enable output printouts

For example:

./iocache_loader -p /user/imagenet/train -t 40

ResNet-50 Training Stress Test - Pass/Fail Criteria

The performance and accuracy test evaluates the loss function received from the device. If the loss function shows an unexpected behavior, the test fails.

  • Accuracy - loss function decreases monotonically without sharp jumps between training tests, indicating that NaN values have been propagated into the training process calculations.

  • Performance (FPS) - the performance [images/sec] is calculated per training epoch.

The expected results per device:

  • HL-325 and HL-325L:

    • FPS: 7300 images/sec

    • Epoch runtime: 2.9 minutes

    • Expected accuracy: first epoch - ~0.15

  • HL-338 PCIe card:

    • FPS: 3850 images/sec

    • Epoch runtime: 5.5 minutes

    • Expected accuracy: first epoch - ~0.15

  • HL-328 PCIe card:

    • FPS: 2850 images/sec

    • Epoch runtime: 7.5 minutes

    • Expected accuracy: first epoch - ~0.15

Note

The above results can only be achieved if you run the test on a bare metal machine and use a dataset saved locally or on a SSD card.

  • HL-225H:

    • FPS: 5900 images/sec

    • Epoch runtime: 3.6 minutes

  • HL-225C:

    • FPS: 5150 images/sec

    • Epoch runtime: 3.9 minutes

  • HL-225D:

    • FPS: 3058 images/sec

    • Epoch runtime: 6.0 minutes

  • Expected accuracy: first epoch - ~0.15

Note

The above results can only be achieved if you run the test on a bare metal machine and use a dataset saved locally or on a SSD card.

  • FPS: 1580 images/sec

  • Epoch runtime: 13.3 minutes

  • Expected accuracy: first epoch - 0.12

ResNet-50 Training Stress Test Plugin Switches and Parameters

The following lists the training test plugin switches and parameters:

hl_qual -gaudi3 | -gaudi2 -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
         -trainingApp [-epoch <number of epochs>] [-n <number of iterations>] [-train_list]

Switches and Parameters

Description

-trainingApp

Training test plugin selector.

-epoch

Defines epoch count.

-n

Defines iteration count. You must provide the number of epochs using the -epoch. For example, -epoch 1 -n 1000.

-train_list

Specifies path to the train_list.txt file.

-log

Writes statistics to file.

./hl_qual -gaudi2 -c all -rmod parallel -trainingApp -epoch 3 -train_list <path to train_list.txt>
./hl_qual -gaudi3 -c all -rmod parallel -trainingApp -epoch 3 -train_list <path to train_list.txt>
hl_qual -gaudi -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
        -trainingApp [-epoch <number of epochs>] [-n <number of iterations>] [rand]

Switches and Parameters

Description

-trainingApp

Training test plugin selector.

-epoch

Defines epoch count.

-n

Defines iteration count. You must provide the number of epochs using the -epoch. For example, -epoch 1 -n 1000.

-rand

Random input generation. This mode disables accuracy and loss validation. Only FPS is calculated. The test uses 1500 iterations preset.

-log

Writes statistics to file.

./hl_qual -gaudi -c all -rmod parallel -trainingApp -rand
./hl_qual -gaudi -c all -rmod parallel -trainingApp -epoch 3

Functional Test 2 Plugin Design Consideration and Responsibilities

Functional test 2 runs all hardware components on the Gaudi SOCs simultaneously in parallel mode to test their functionality and interaction between the different units. The functional test 2 uses a synthetic topology based on multiple operations. In these operations, all computational units and memory space are utilized while maintaining high power usage. The test can run for long hours (more than two) testing the following device functionalities:

  • Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms. These are tested while running the power stress plugin in extreme load.

  • PID and clock relaxation mechanisms verification.

  • Long work periods in typical power levels.

  • Full bit-exact calculation.

Tested units:

  • PCI links

  • DMA engines – moving data between:

    • PCI ==> HBM, HBM ==> PCI

    • HBM ==> SRAM, SRAM ==> HBM

  • MME engines

  • TPC engines

  • Serdes connectivity - only when using -serdes switch

Prerequisites

To run this test on Gaudi 3, 1.5TB-2TB of free memory is required.

Functional Test 2 Synthetic Topology

The functional test 2 plugin builds a test topology including large tensors and multiple operators (Conv, Batchnorm). The test application runs the topology by injecting pre-calculated inputs and verifies the output against a pre-calculated reference for each topology execution to ensure bit-exact results. When applying the -serdes switch, the topology includes full Serdes transmit/receive verification in addition to the basic topology.

Functional Test 2 Testing Modes

The test verifies the calculation result on each topology execution while enabling high power usage. All execution steps are verified for this purpose. The test also verifies execution throughput [executions/seconds] against the performance metrics in the form of measured frame per second (FPS). The test supports the following sub-test modes:

  • HL-325 and HL-325L:

    • Extreme - Measured power level for 54V power supply: 860 [watts]

    • High – Measured power level for 54V power supply: 790 [watts]

  • HL-338 PCIe card:

    • Extreme - Measured power level for 54V power supply: 600 [watts]

    • High – Measured power level for 54V power supply: 600 [watts]

  • HL-328 PCIe card:

    • Extreme - Measured power level for 54V power supply: 635-645 [watts]

    • High – Measured power level for 54V power supply: 515-525 [watts]

  • HL-225H and HL-225C:

    • Extreme - Measured power level for 54V power supply: 530-560 [watts]

    • High - Measured power level for 54V power supply: 450-490 [watts]

  • HL-225D:

    • Extreme - Measured power level for 54V power supply: 360 [watts]

    • High - Measured power level for 54V power supply: 270 [watts]

  • Extreme - Measured power level: 345-355 [watts]

  • High - Measured power level: 230-240 [watts]

The measurement above is recorded from a four minute run. It can change depending on the ambient condition of the system (fan speed, server box configuration and ambient temperatures).

Note

The initialization stage can take up to 170 seconds. This time is needed to recalculate and generate the reference expected output tensors, compile the test topology and test runtime execution calibration. When using -t switch, the init time is not included in the test running duration set.

Functional Test - Pass/Fail Criteria

The pass/fail criteria is composed of the following:

  • The calculated value of each topology launch must be identical to a pre-calculated reference.

  • The execution throughput [executions/seconds] must not fall below an existing predefined threshold:

  • HL-325 and HL-325L - Regular functional test:

    • Extreme - FPS 751 [Frame/Sec], measured on HLS3 server

    • High - FPS 1275 [Frame/Sec], measured on HLS3 server

  • HL-325 and HL-325L - Regular functional test + Serdes test:

    • Extreme - FPS 1458 [Frame/Sec], measured on HLS3 server

    • High - FPS 2450 [Frame/Sec], measured on HLS3 server

  • HL-338 PCIe card - Regular functional test:

    • Extreme - FPS 391 [Frame/Sec], measured on HLS3 server

    • High - FPS 670 [Frame/Sec], measured on HLS3 server

  • HL-338 PCIe card - Regular functional test + Serdes test:

    • Extreme - FPS 734 [Frame/Sec], measured on HLS3 server

    • High - FPS 1242 [Frame/Sec], measured on HLS3 server

  • HL-328 PCIe card - Regular functional test:

    • Extreme - FPS 550 [Frame/Sec], measured on HLS3 server

    • High - FPS 1104 [Frame/Sec], measured on HLS3 server

  • HL-328 PCIe card - Regular functional test + Serdes test:

    • Extreme - FPS 550 [Frame/Sec], measured on HLS3 server

    • High - FPS 1104 [Frame/Sec], measured on HLS3 server

  • HL-225H - Regular functional test:

    • Extreme - FPS 621 [Frame/Sec], measured on HLS2 server

    • High - FPS 729 [Frame/Sec], measured on HLS2 server

  • HL-225H - Regular functional test + Serdes test:

    • Extreme - FPS 1233 [Frame/Sec], measured on HLS2 server

    • High - FPS 1450 [Frame/Sec], measured on HLS2 server

  • HL-225C - Regular functional test:

    • Extreme - FPS 621 [Frame/Sec], measured on HLS2 server

    • High - FPS 657 [Frame/Sec], measured on HLS2 server

  • HL-225C - Regular functional test + Serdes test:

    • Extreme - FPS 1200 [Frame/Sec], measured on HLS2 server

    • High - FPS 1296 [Frame/Sec], measured on HLS2 server

  • HL-225D - Regular functional test:

    • Extreme - FPS 585 [Frame/Sec], measured on HLS2 server

    • High - FPS 585 [Frame/Sec], measured on HLS2 server

  • HL-225D - Regular functional test + Serdes test:

    • Extreme - FPS 1151 [Frame/Sec], measured on HLS2 server

    • High - FPS 1151 [Frame/Sec], measured on HLS2 server

  • Extreme - FPS 260 [Frame/Sec], measured on HLS1 server

  • High - FPS 270 [Frame/Sec], measured on HLS1 server

The measurement above is recorded from a 4 minute run.

Functional Test 2 Plugin Switches and Parameters

hl_qual -gaudi3 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 -l <extreme | high> [-d] [-dis_val] [-serdes <int | ext>]

Switches and Parameters

Description

-t

Test duration in seconds.

-f2

Functional test 2 plugin selector.

-d

Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW.

-dis_val

Disables output tensor validation. The test does not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification.

-serdes <int | ext>

  • int - Runs allreduce collective operation to test Serdes connectivity along with the other components tested using the Functional 2 test.

  • ext - Runs loopback test on the external ports. The loopback dongles must be fitted on the external ports.

If the value is not specified, the default value is int.

-l <extreme | high>

Power level selector for 54V power supply. For the power levels, refer to Functional Test 2 Testing Modes.

If the value is not specified, the default value is High.

./hl_qual -gaudi3 -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi3 -c all -rmod parallel -f2 -l extreme -t 450 -serdes
./hl_qual -gaudi3 -c all -rmod parallel -f2 -l extreme -t 450 -serdes int
hl_qual -gaudi2 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 -l <extreme | high> [-d] [-dis_val] [-serdes <int | ext>] [-enable_ports_check <all | int>]

Switches and Parameters

Description

-t

Test duration in seconds.

-f2

Functional test 2 plugin selector.

-d

Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW.

-dis_val

Disables output tensor validation. The test does not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification.

-serdes <int | ext>

  • int - Runs allreduce collective operation to test Serdes connectivity along with the other components tested using the Functional 2 test.

  • ext - Runs loopback test on the external ports. The loopback dongles must be fitted on the external ports.

If the value is not specified, the default value is int.

-l <extreme | high>

Power level selector. For the power levels, refer to Functional Test 2 Testing Modes.

If the value is not specified, the default value is High.

-enable_ports_check <all / int>

Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:

  • all - Checks all the external and internal ports.

  • int - Checks the internal ports only.

./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes int
hl_qual -gaudi -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 -l <extreme | high> [-d] [-dis_val] [-serdes] [-enable_ports_check <all | int>]

Switches and Parameters

Description

-t

Test duration in seconds.

-f2

Functional test 2 plugin selector.

-d

Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW.

-dis_val

Disables output tensor validation. The test does not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification.

-serdes

Runs allreduce collective operation to test Serdes connectivity along with the other components tested using the functional 2 test.

-l <extreme | high>

Power level selector. For the power levels, refer to Functional Test 2 Testing Modes.

If the value is not specified, the default value is High.

-enable_ports_check <all / int>

Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:

  • all - Checks all the external and internal ports.

  • int - Checks the internal ports only.

./hl_qual -gaudi -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi -c all -rmod parallel -f2 -l high -t 450 -serdes