Functional Test Plugins Design, Switches and Parameters¶

This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.

Functional tests verify the full chip functionality while running several chip hardware modules in parallel and in a synchronized manner. All the tests described in this section verify the accuracy of calculation in parallel to performance metrics in the form of a measured frame per second. Executing the accuracy check on the host does not affect the FPS measurement of the test.

ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities¶

Note

Before running this plugin, make sure to set export __python_cmd=python3 environment variable.

The ResNet-50 training stress test plugin runs a functional ResNet-50 training test as a real life training scenario. The test verifies accuracy and performance. To enable an accuracy check, you must supply a full ImageNet training dataset.

ResNet-50 Training Stress Test Plugin Testing Modes¶

The ResNet-50 training stress test plugin supports a single testing mode of batch size 256, which is 256 images in each batch.
(First-gen Gaudi only) Random Data - The test can run on random data tensors. When using this mode, the accuracy check is skipped and only the achievable FPS is taken into account in the pass/fail criteria. It enables pure FPS test without depending on the image augmenter. This test is limited to 1000 iterations.
ImageNet - When using ImageNet dataset, the test evaluates accuracy and FPS. The FPS could be influenced by the image augmenter AEON in case of first-gen Gaudi and Intel Gaudi Media Pipe in case of Gaudi 3 and Gaudi 2. For more details, refer to Accuracy and FPS Evaluation.

Note

For first-gen Gaudi, running the test with ImageNet dataset is the default mode. To run with random data, use the -rand switch.

The suggested number of training epochs should not exceed 90. The ResNet50 training app should converge with that range. 90 epoch runs for 20-21 hours.

Prerequisites¶

Gaudi 3 and Gaudi 2

Make sure to install habanalabs-qual-workloads as shown in the Driver and Software Installation.

First-gen Gaudi

The trainingApp test uses the AEON augmenter which could be a limiting factor on the achievable FPS results when running on multiple devices. Before running this testing mode with ImageNet dataset on first-gen Gaudi, make sure you have the following:

96 CPU cores per server (minimum), evenly distributed between the NUMA nodes.

A successful pass of the Memory Bandwidth Test and the PCI Bandwidth Test.

512 GB of RAM.

Accuracy and FPS Evaluation¶

When running this test with ImageNet testing mode, the FPS results may be influenced by the different augmentation and image preprocessing mechanisms used for first-gen Gaudi and Gaudi 2:

First-gen Gaudi - Uses AEON augmenter which utilizes the host CPU.

Gaudi 3 and Gaudi 2 - Use the Intel Gaudi Media Pipe for Jpeg decoding and image preprocessing. The preprocessing of the ImageNet dataset is done on on the decoder HW path and it does not require CPU resources.

ResNet-50 Training Stress Test Plugin Configuration Files and Requirements¶

Before using the plugin, make sure to perform the following:

Download the training/validation dataset from ImageNet.
Download the labels file from ImageNet Object Localization Challenge page.
Untar the imagenet tar file (ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar). Use the following commands:
- mkdir tar_folder
- cd tar_folder
- tar -xvf ../ILSVRC2012_img_train.tar

The above instructions assume that ILSVRC2012_img_train.tar file resides on the same folder level as tar_folder.

Change your current directory to hl_qual bin directory.
Run the preparation script - prepare.sh (included in the package). This script untars all the tar files (ILSVRC2012_img_train.tar file), and generates the training list file (train_list.txt).
./prepare.sh -m MODE -d EXTRACTED_DIR -f LABEL_FILE [-h]

Parameters:

MODE - ‘train’ or ‘val’.
EXTRACTED_DIR - The path to the directory that contains the untared files from ILSVRC2012_img_train.tar file.
LABEL_FILE - The path to LOC_synset_mapping.txt file.

The script generates and updates the following items:

train_list.txt - A file containing the location of all jpeg files and the labels associated with the image. Each line contains one image path and one label. For example: For Gaudi 3 and Gaudi 2, the path to train_list.txt should be specified using the -train_list switch.

@FILE   STRING
train/n01582220/n01582220_8497.JPEG     18
train/n01582220/n01582220_11460.JPEG    18
train/n01582220/n01582220_20482.JPEG    18
train/n01582220/n01582220_37512.JPEG    18
train/n01582220/n01582220_5317.JPEG     18
train/n01582220/n01582220_4839.JPEG     18

train folder - This folder contains 1000 sub-folders each containing the images associated with the same class (label).
(For first-gen Gaudi) training256.json - The list below changes to include the full path to the train folder and the train_list file as generated by the prepare.sh script.
- “manifest_filename”: “/home/labuser/builds/qual_release_build/gaudi/bin/train_list.txt”,
- “manifest_root”: “/home/labuser/builds/qual_release_build/gaudi/bin”,

Note

Reinstalling the package or relaunching the script overrides/overwrites those files. It is highly recommended to make a copy of the following files/folders so you can restore them in case they are overwritten:

train_list.txt
train folder
training256.json

For first-gen Gaudi, it is important to note that you can place the train folder and the train_list.txt file in any location within the users file system as long as the training256.json file placed under /opt/habanalabs/qual/gaudi/bin is updated.

ResNet-50 and Linux File Cache Considerations¶

Linux OS has a file cache that accelerates IO operations. Once a file has been read, it is loaded into the host RAM’s file cache so that subsequent reads of the same file are faster. As a result, when running the training plugin with the full ImageNet validations set, which has the size of 140GB, the first training epoch performance could be lower than expected. To address this issue, perform one of the following:

Run a dummy training run for a full epoch and ignore the failure. During this dummy run, the full ImageNet dataset is uploaded to the Linux cache to ensure that the following run achieves the expected performance.
Use the iocache_loader application supplied in the hl_qual installation folder to upload the ImageNet dataset to the Linux cache.

Note

After rebooting the host, the Linux file cache is cleared so that one of the above options can be executed to achieve the expected performance.

Running iocache_loader Application¶

The following is iocache_loader command line interface. To receive the applicable switches, run the command without any parameters.

$ ./iocache_loader
path to training dataset must be supplied
./iocache_loader -p <path> -t <num of threads> -e
-p <path> - path to dataset - this is mandatory
-t <num of threads> - number of threads default 20
-e enable output printouts

For example:

./iocache_loader -p /user/imagenet/train -t 40

ResNet-50 Training Stress Test - Pass/Fail Criteria¶

The performance and accuracy test evaluates the loss function received from the device. If the loss function shows an unexpected behavior, the test fails.

Accuracy - loss function decreases monotonically without sharp jumps between training tests, indicating that NaN values have been propagated into the training process calculations.
Performance (FPS) - the performance [images/sec] is calculated per training epoch.

The expected results per device:

Gaudi 3

HL-325 and HL-325L:

FPS: 7000 images/sec

Epoch runtime: 2.9 minutes

Expected accuracy: first epoch - ~0.15

HL-338 PCIe card:

FPS: 3850 images/sec

Epoch runtime: 5.5 minutes

Expected accuracy: first epoch - ~0.15

HL-328 PCIe card:
- FPS: 2850 images/sec
- Epoch runtime: 7.5 minutes
- Expected accuracy: first epoch - ~0.15

Note

The above results can only be achieved if you run the test on a bare metal machine and use a dataset saved locally or on a SSD card.

Gaudi 2

HL-225H:
- FPS: 5900 images/sec
- Epoch runtime: 3.6 minutes
- Expected accuracy: first epoch - ~0.15
HL-225C:
- FPS: 5150 images/sec
- Epoch runtime: 3.9 minutes
- Expected accuracy: first epoch - ~0.15
HL-225D:
- FPS: 4474 images/sec
- Epoch runtime: 4.7 minutes
- Expected accuracy: first epoch - ~0.16
HL-288 PCIe card:
- FPS: 4514 images/sec
- Epoch runtime: 4.7 minutes
- Expected accuracy: first epoch - ~0.16

Note

The above results can only be achieved if you run the test on a bare metal machine and use a dataset saved locally or on a SSD card.

First-gen Gaudi

FPS: 1580 images/sec
Epoch runtime: 13.3 minutes
Expected accuracy: first epoch - 0.12

ResNet-50 Training Stress Test Plugin Switches and Parameters¶

The following lists the training test plugin switches and parameters:

Gaudi 3 and Gaudi 2

hl_qual -gaudi3 | -gaudi2 -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
         -trainingApp [-epoch <number of epochs>] [-train_list]

Switches and Parameters	Description
`-trainingApp`	Training test plugin selector.
`-epoch`	Defines epoch count.
`-train_list`	Specifies path to the `train_list.txt` file.
`-log`	Writes statistics to file.

./hl_qual -gaudi2 -c all -rmod parallel -trainingApp -epoch 3 -train_list <path to train_list.txt>
./hl_qual -gaudi3 -c all -rmod parallel -trainingApp -epoch 3 -train_list <path to train_list.txt>

First-gen Gaudi

hl_qual -gaudi -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
        -trainingApp [-epoch <number of epochs>] [-n <number of iterations>] [rand]

Switches and Parameters	Description
`-trainingApp`	Training test plugin selector.
`-epoch`	Defines epoch count.
`-n`	Defines iteration count. You must provide the number of epochs using the `-epoch`. For example, `-epoch 1 -n 1000.`
`-rand`	Random input generation. This mode disables accuracy and loss validation. Only FPS is calculated. The test uses 1500 iterations preset.
`-log`	Writes statistics to file.

./hl_qual -gaudi -c all -rmod parallel -trainingApp -rand
./hl_qual -gaudi -c all -rmod parallel -trainingApp -epoch 3

Functional Test 2 Plugin Design Consideration and Responsibilities¶

Functional test 2 runs all hardware components on the Gaudi SOCs simultaneously in parallel mode to test their functionality and interaction between the different units. The functional test 2 uses a synthetic topology based on multiple operations. In these operations, all computational units and memory space are utilized while maintaining high power usage. The test can run for long hours (more than two) testing the following device functionalities:

Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms. These are tested while running the power stress plugin in extreme load.
PID and clock relaxation mechanisms verification.
Long work periods in typical power levels.
Full bit-exact calculation.

Tested units:

PCI links
DMA engines – moving data between:
- PCI ==> HBM, HBM ==> PCI
- HBM ==> SRAM, SRAM ==> HBM
MME engines
TPC engines
Serdes connectivity - only when using -serdes switch

Note

The test initialization stage can take up to 170 seconds. This time is needed to recalculate and generate the reference expected output tensors, compile the test topology and test runtime execution calibration. When using the -t switch, the initialization time is not included in the test running duration.

Prerequisites¶

To run this test on Gaudi 3, 1.5TB-2TB of free memory is required.

Functional Test 2 Synthetic Topology¶

The functional test 2 plugin builds a test topology including large tensors and multiple operators (Conv, Batchnorm). The test application runs the topology by injecting pre-calculated inputs and verifies the output against a pre-calculated reference for each topology execution to ensure bit-exact results. When applying the -serdes switch, the topology includes full Serdes transmit/receive verification in addition to the basic topology.

Functional Test 2 Expected Maximum Power Levels¶

The test verifies the calculation result on each topology execution while enabling high power usage. All execution steps are verified for this purpose. The test also verifies execution throughput [executions/seconds] against the performance metrics in the form of measured frame per second (FPS). The following lists the expected maximum power levels:

Gaudi 3

HL-325 and HL-325L - Measured power level for 54V power supply: 820-835 [watts]
HL-338 PCIe card - Measured power level for 54V power supply: 600 [watts]
HL-328 PCIe card - Measured power level for 54V power supply: 515-525 [watts]

Gaudi 2

HL-225H and HL-225C:
- Extreme - Measured power level: 530-560 [watts]
- High - Measured power level: 450-490 [watts]
HL-225D:
- Extreme - Measured power level: 410 [watts]
- High - Measured power level: 310 [watts]
HL-288 PCIe card:
- Extreme - Measured power level: 390 [watts]
- High - Measured power level: 310 [watts]

First-gen Gaudi

Extreme - Measured power level: 345-355 [watts]
High - Measured power level: 230-240 [watts]

The measurement above is recorded from a four minute run. It can change depending on the ambient condition of the system (fan speed, server box configuration and ambient temperatures).

Functional Test - Pass/Fail Criteria¶

The pass/fail criteria is composed of the following:

The calculated value of each topology launch must be identical to a pre-calculated reference.
The execution throughput [executions/seconds] must not fall below an existing predefined threshold:

Gaudi 3

HL-325 and HL-325L - Regular functional test:
- FPS 1235 [Frame/Sec], measured on HLS3 server
HL-325 and HL-325L - Regular functional test + Serdes test:
- FPS 2372 [Frame/Sec], measured on HLS3 server
HL-338 PCIe card - Regular functional test:
- FPS 770-795 [Frame/Sec], measured on HLS3 server
HL-338 PCIe card - Regular functional test + Serdes test:
- FPS 1242 [Frame/Sec], measured on HLS3 server
HL-328 PCIe card - Regular functional test:
- FPS 1104 [Frame/Sec], measured on HLS3 server
HL-328 PCIe card - Regular functional test + Serdes test:
- FPS 1104 [Frame/Sec], measured on HLS3 server

Gaudi 2

HL-225H - Regular functional test:
- Extreme - FPS 621 [Frame/Sec], measured on HLS2 server
- High - FPS 729 [Frame/Sec], measured on HLS2 server
HL-225H - Regular functional test + Serdes test:
- Extreme - FPS 1233 [Frame/Sec], measured on HLS2 server
- High - FPS 1450 [Frame/Sec], measured on HLS2 server
HL-225C - Regular functional test:
- Extreme - FPS 621 [Frame/Sec], measured on HLS2 server
- High - FPS 657 [Frame/Sec], measured on HLS2 server
HL-225C - Regular functional test + Serdes test:
- Extreme - FPS 1200 [Frame/Sec], measured on HLS2 server
- High - FPS 1296 [Frame/Sec], measured on HLS2 server
HL-225D - Regular functional test:
- Extreme - FPS 682 [Frame/Sec], measured on HLS2 server
- High - FPS 749 [Frame/Sec], measured on HLS2 server
HL-225D - Regular functional test + Serdes test:
- Extreme - FPS 1200 [Frame/Sec], measured on HLS2 server
- High - FPS 1410 [Frame/Sec], measured on HLS2 server
HL-288 PCIe card - Regular functional test:
- Extreme - FPS 710 [Frame/Sec], measured on HLS2 server
- High - FPS 720 [Frame/Sec], measured on HLS2 server
HL-288 PCIe card - Regular functional test + Serdes test:
- Extreme - FPS 1400 [Frame/Sec], measured on HLS2 server
- High - FPS 1400 [Frame/Sec], measured on HLS2 server

First-gen Gaudi

Extreme - FPS 260 [Frame/Sec], measured on HLS1 server
High - FPS 270 [Frame/Sec], measured on HLS1 server

The measurement above is recorded from a 10 minutes run.

Functional Test 2 Plugin Switches and Parameters¶

Gaudi 3

hl_qual -gaudi3 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 [-d] [-serdes <int | ext>] [-enable_ports_check <all | int>] [-toggle]

Switches and Parameters	Description
`-t`	Test duration in seconds. Applicable range: 240-172800.
`-f2`	Functional test 2 plugin selector.
`-d`	Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW.
`-serdes <int \| ext>`	`int` - Runs allreduce collective operation to test Serdes connectivity along with the other components tested using the Functional 2 test. `ext` - Runs loopback test on the external ports. The loopback dongles must be fitted on the external ports. If the value is not specified, the default value is `int`.
`-enable_ports_check <all \| int>`	Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails: `all` - Checks all the external and internal ports. `int` - Checks the internal ports only.
`-toggle`	Enables ports toggle check and count number of port toggles during the test.

./hl_qual -gaudi3 -c all -rmod parallel -f2 -t 450
./hl_qual -gaudi3 -c all -rmod parallel -f2 -t 450 -serdes
./hl_qual -gaudi3 -c all -rmod parallel -f2 -t 450 -serdes int

Gaudi 2

hl_qual -gaudi2 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 -l <extreme | high> [-d] [-serdes <int | ext>] [-enable_ports_check <all | int>] [-toggle]

Switches and Parameters	Description
`-t`	Test duration in seconds. Applicable range: 240-172800.
`-f2`	Functional test 2 plugin selector.
`-d`	Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW.
`-serdes <int \| ext>`	`int` - Runs allreduce collective operation to test Serdes connectivity along with the other components tested using the Functional 2 test. `ext` - Runs loopback test on the external ports. The loopback dongles must be fitted on the external ports. If the value is not specified, the default value is `int`.
`-l <extreme \| high>`	Power level selector. For the power levels measurements, refer to Functional Test 2 Expected Maximum Power Levels. If the value is not specified, the default value is High.
`-enable_ports_check <all \| int>`	Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails: `all` - Checks all the external and internal ports. `int` - Checks the internal ports only.
`-toggle`	Enables ports toggle check and count number of port toggles during the test.

./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes int

First-gen Gaudi

hl_qual -gaudi -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -f2 -l <extreme | high> [-d] [-dis_val] [-serdes] [-enable_ports_check <all | int>]

Switches and Parameters	Description
`-t`	Test duration in seconds.
`-f2`	Functional test 2 plugin selector.
`-d`	Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW.
`-dis_val`	Disables output tensor validation. The test does not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification.
`-serdes`	Runs allreduce collective operation to test Serdes connectivity along with the other components tested using the functional 2 test.
`-l <extreme \| high>`	Power level selector. For the power level measurements, refer to Functional Test 2 Expected Maximum Power Levels. If the value is not specified, the default value is High.
`-enable_ports_check <all \| int>`	Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails: `all` - Checks all the external and internal ports. `int` - Checks the internal ports only.

./hl_qual -gaudi -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi -c all -rmod parallel -f2 -l high -t 450 -serdes

Gaudi Documentation 1.21.1 documentation

Functional Test Plugins Design, Switches and Parameters

On this Page

Functional Test Plugins Design, Switches and Parameters¶

ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities¶

ResNet-50 Training Stress Test Plugin Testing Modes¶

Prerequisites¶

Accuracy and FPS Evaluation¶

ResNet-50 Training Stress Test Plugin Configuration Files and Requirements¶

ResNet-50 and Linux File Cache Considerations¶

Running iocache_loader Application¶

ResNet-50 Training Stress Test - Pass/Fail Criteria¶

ResNet-50 Training Stress Test Plugin Switches and Parameters¶

Functional Test 2 Plugin Design Consideration and Responsibilities¶

Prerequisites¶

Functional Test 2 Synthetic Topology¶

Functional Test 2 Expected Maximum Power Levels¶

Functional Test - Pass/Fail Criteria¶

Functional Test 2 Plugin Switches and Parameters¶