Functional Test Plugins Design, Switches and Parameters
On this Page
Functional Test Plugins Design, Switches and Parameters¶
This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.
Functional tests verify the full chip functionality while running several chip hardware modules in parallel and in a synchronized manner. All the tests described in this section verify the accuracy of calculation in parallel to performance metrics in the form of a measured frame per second. Executing the accuracy check on the host does not affect the FPS measurement of the test.
ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities¶
Note
Before running this plugin, make sure to set export __python_cmd=python3
environment variable.
The ResNet-50 training stress test plugin runs a functional ResNet-50 training test as a real life training scenario. The test verifies accuracy and performance. To enable an accuracy check, you must supply a full ImageNet training dataset.
ResNet-50 Training Stress Test Plugin Testing Modes¶
The ResNet-50 training stress test plugin supports a single testing mode of batch size 256, which is 256 images in each batch.
(First-gen Gaudi only) Random Data - The test can run on random data tensors. When using this mode, the accuracy check is skipped and only the achievable FPS is taken into account in the pass/fail criteria. It enables pure FPS test without depending on the image augmenter. This test is limited to 1000 iterations.
ImageNet - When using ImageNet dataset, the test evaluates accuracy and FPS. The FPS could be influenced by the image augmenter AEON in case of first-gen Gaudi and Intel Gaudi Media Pipe in case of Gaudi 3 and Gaudi 2. For more details, refer to Accuracy and FPS Evaluation.
Note
For first-gen Gaudi, running the test with ImageNet dataset is the default mode. To run with random data, use the -rand
switch.
The suggested number of training epochs should not exceed 90. The ResNet50 training app should converge with that range. 90 epoch runs for 20-21 hours.
Prerequisites¶
Make sure to install habanalabs-qual-workloads
as shown in the Driver and Software Installation.
The trainingApp test uses the AEON augmenter which could be a limiting factor on the achievable FPS results when running on multiple devices. Before running this testing mode with ImageNet dataset on first-gen Gaudi, make sure you have the following:
96 CPU cores per server (minimum), evenly distributed between the NUMA nodes.
A successful pass of the Memory Bandwidth Test and the PCI Bandwidth Test.
512 GB of RAM.
Accuracy and FPS Evaluation¶
When running this test with ImageNet testing mode, the FPS results may be influenced by the different augmentation and image preprocessing mechanisms used for first-gen Gaudi and Gaudi 2:
First-gen Gaudi - Uses AEON augmenter which utilizes the host CPU.
Gaudi 3 and Gaudi 2 - Use the Intel Gaudi Media Pipe for Jpeg decoding and image preprocessing. The preprocessing of the ImageNet dataset is done on on the decoder HW path and it does not require CPU resources.
ResNet-50 Training Stress Test Plugin Configuration Files and Requirements¶
Before using the plugin, make sure to perform the following:
Download the training/validation dataset from ImageNet.
Download the labels file from ImageNet Object Localization Challenge page.
Untar the imagenet tar file (
ILSVRC2012_img_train.tar
,ILSVRC2012_img_val.tar
). Use the following commands:mkdir tar_folder
cd tar_folder
tar -xvf ../ILSVRC2012_img_train.tar
The above instructions assume that ILSVRC2012_img_train.tar
file resides on the same folder level as tar_folder.
Change your current directory to hl_qual bin directory.
Run the preparation script -
prepare.sh
(included in the package). This script untars all the tar files (ILSVRC2012_img_train.tar
file), and generates the training list file (train_list.txt
)../prepare.sh -m MODE -d EXTRACTED_DIR -f LABEL_FILE [-h]
Parameters:
MODE
- ‘train’ or ‘val’.EXTRACTED_DIR
- The path to the directory that contains the untared files fromILSVRC2012_img_train.tar
file.LABEL_FILE
- The path toLOC_synset_mapping.txt
file.
The script generates and updates the following items:
train_list.txt
- A file containing the location of all jpeg files and the labels associated with the image. Each line contains one image path and one label. For example: For Gaudi 3 and Gaudi 2, the path totrain_list.txt
should be specified using the-train_list
switch.@FILE STRING train/n01582220/n01582220_8497.JPEG 18 train/n01582220/n01582220_11460.JPEG 18 train/n01582220/n01582220_20482.JPEG 18 train/n01582220/n01582220_37512.JPEG 18 train/n01582220/n01582220_5317.JPEG 18 train/n01582220/n01582220_4839.JPEG 18
train folder
- This folder contains 1000 sub-folders each containing the images associated with the same class (label).(For first-gen Gaudi)
training256.json
- The list below changes to include the full path to the train folder and the train_list file as generated by theprepare.sh
script.“manifest_filename”: “/home/labuser/builds/qual_release_build/gaudi/bin/train_list.txt”,
“manifest_root”: “/home/labuser/builds/qual_release_build/gaudi/bin”,
Note
Reinstalling the package or relaunching the script overrides/overwrites those files. It is highly recommended to make a copy of the following files/folders so you can restore them in case they are overwritten:
train_list.txt
train folder
training256.json
For first-gen Gaudi, it is important to note that you can place the train folder
and the train_list.txt
file in any location within the users file system as long as the training256.json
file placed under /opt/habanalabs/qual/gaudi/bin
is updated.
ResNet-50 and Linux File Cache Considerations¶
Linux OS has a file cache that accelerates IO operations. Once a file has been read, it is loaded into the host RAM’s file cache so that subsequent reads of the same file are faster. As a result, when running the training plugin with the full ImageNet validations set, which has the size of 140GB, the first training epoch performance could be lower than expected. To address this issue, perform one of the following:
Run a dummy training run for a full epoch and ignore the failure. During this dummy run, the full ImageNet dataset is uploaded to the Linux cache to ensure that the following run achieves the expected performance.
Use the iocache_loader application supplied in the hl_qual installation folder to upload the ImageNet dataset to the Linux cache.
Note
After rebooting the host, the Linux file cache is cleared so that one of the above options can be executed to achieve the expected performance.
Running iocache_loader Application¶
The following is iocache_loader command line interface. To receive the applicable switches, run the command without any parameters.
$ ./iocache_loader
path to training dataset must be supplied
./iocache_loader -p <path> -t <num of threads> -e
-p <path> - path to dataset - this is mandatory
-t <num of threads> - number of threads default 20
-e enable output printouts
For example:
./iocache_loader -p /user/imagenet/train -t 40
ResNet-50 Training Stress Test - Pass/Fail Criteria¶
The performance and accuracy test evaluates the loss function received from the device. If the loss function shows an unexpected behavior, the test fails.
Accuracy - loss function decreases monotonically without sharp jumps between training tests, indicating that NaN values have been propagated into the training process calculations.
Performance (FPS) - the performance [images/sec] is calculated per training epoch.
The expected results per device:
HL-325 and HL-325L:
FPS: 7300 images/sec
Epoch runtime: 2.9 minutes
Expected accuracy: first epoch - ~0.15
HL-338 PCIe card:
FPS: 3850 images/sec
Epoch runtime: 5.5 minutes
Expected accuracy: first epoch - ~0.15
HL-328 PCIe card:
FPS: 2850 images/sec
Epoch runtime: 7.5 minutes
Expected accuracy: first epoch - ~0.15
Note
The above results can only be achieved if you run the test on a bare metal machine and use a dataset saved locally or on a SSD card.
HL-225H:
FPS: 5900 images/sec
Epoch runtime: 3.6 minutes
HL-225C:
FPS: 5150 images/sec
Epoch runtime: 3.9 minutes
HL-225D:
FPS: 3058 images/sec
Epoch runtime: 6.0 minutes
Expected accuracy: first epoch - ~0.15
Note
The above results can only be achieved if you run the test on a bare metal machine and use a dataset saved locally or on a SSD card.
FPS: 1580 images/sec
Epoch runtime: 13.3 minutes
Expected accuracy: first epoch - 0.12
ResNet-50 Training Stress Test Plugin Switches and Parameters¶
The following lists the training test plugin switches and parameters:
hl_qual -gaudi3 | -gaudi2 -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-trainingApp [-epoch <number of epochs>] [-n <number of iterations>] [-train_list]
Switches and Parameters |
Description |
---|---|
|
Training test plugin selector. |
|
Defines epoch count. |
|
Defines iteration count. You must provide the number of epochs
using the |
|
Specifies path to the |
|
Writes statistics to file. |
./hl_qual -gaudi2 -c all -rmod parallel -trainingApp -epoch 3 -train_list <path to train_list.txt>
./hl_qual -gaudi3 -c all -rmod parallel -trainingApp -epoch 3 -train_list <path to train_list.txt>
hl_qual -gaudi -c <pci bus id> -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-trainingApp [-epoch <number of epochs>] [-n <number of iterations>] [rand]
Switches and Parameters |
Description |
---|---|
|
Training test plugin selector. |
|
Defines epoch count. |
|
Defines iteration count. You must provide the number of epochs
using the |
|
Random input generation. This mode disables accuracy and loss validation. Only FPS is calculated. The test uses 1500 iterations preset. |
|
Writes statistics to file. |
./hl_qual -gaudi -c all -rmod parallel -trainingApp -rand
./hl_qual -gaudi -c all -rmod parallel -trainingApp -epoch 3
Functional Test 2 Plugin Design Consideration and Responsibilities¶
Functional test 2 runs all hardware components on the Gaudi SOCs simultaneously in parallel mode to test their functionality and interaction between the different units. The functional test 2 uses a synthetic topology based on multiple operations. In these operations, all computational units and memory space are utilized while maintaining high power usage. The test can run for long hours (more than two) testing the following device functionalities:
Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms. These are tested while running the power stress plugin in extreme load.
PID and clock relaxation mechanisms verification.
Long work periods in typical power levels.
Full bit-exact calculation.
Tested units:
PCI links
DMA engines – moving data between:
PCI ==> HBM, HBM ==> PCI
HBM ==> SRAM, SRAM ==> HBM
MME engines
TPC engines
Serdes connectivity - only when using
-serdes
switch
Prerequisites¶
To run this test on Gaudi 3, 1.5TB-2TB of free memory is required.
Functional Test 2 Synthetic Topology¶
The functional test 2 plugin builds a test topology including large tensors and multiple operators (Conv, Batchnorm).
The test application runs the topology by injecting pre-calculated inputs and verifies the output against a pre-calculated reference for each topology execution to ensure bit-exact results.
When applying the -serdes
switch, the topology includes full Serdes transmit/receive verification in addition to the basic topology.
Functional Test 2 Testing Modes¶
The test verifies the calculation result on each topology execution while enabling high power usage. All execution steps are verified for this purpose. The test also verifies execution throughput [executions/seconds] against the performance metrics in the form of measured frame per second (FPS). The test supports the following sub-test modes:
HL-325 and HL-325L:
Extreme - Measured power level for 54V power supply: 860 [watts]
High – Measured power level for 54V power supply: 790 [watts]
HL-338 PCIe card:
Extreme - Measured power level for 54V power supply: 600 [watts]
High – Measured power level for 54V power supply: 600 [watts]
HL-328 PCIe card:
Extreme - Measured power level for 54V power supply: 635-645 [watts]
High – Measured power level for 54V power supply: 515-525 [watts]
HL-225H and HL-225C:
Extreme - Measured power level for 54V power supply: 530-560 [watts]
High - Measured power level for 54V power supply: 450-490 [watts]
HL-225D:
Extreme - Measured power level for 54V power supply: 360 [watts]
High - Measured power level for 54V power supply: 270 [watts]
Extreme - Measured power level: 345-355 [watts]
High - Measured power level: 230-240 [watts]
The measurement above is recorded from a four minute run. It can change depending on the ambient condition of the system (fan speed, server box configuration and ambient temperatures).
Note
The initialization stage can take up to 170 seconds. This time is needed to recalculate and generate
the reference expected output tensors, compile the test topology and test runtime execution calibration.
When using -t
switch, the init time is not included in the test running duration set.
Functional Test - Pass/Fail Criteria¶
The pass/fail criteria is composed of the following:
The calculated value of each topology launch must be identical to a pre-calculated reference.
The execution throughput [executions/seconds] must not fall below an existing predefined threshold:
HL-325 and HL-325L - Regular functional test:
Extreme - FPS 751 [Frame/Sec], measured on HLS3 server
High - FPS 1275 [Frame/Sec], measured on HLS3 server
HL-325 and HL-325L - Regular functional test + Serdes test:
Extreme - FPS 1458 [Frame/Sec], measured on HLS3 server
High - FPS 2450 [Frame/Sec], measured on HLS3 server
HL-338 PCIe card - Regular functional test:
Extreme - FPS 391 [Frame/Sec], measured on HLS3 server
High - FPS 670 [Frame/Sec], measured on HLS3 server
HL-338 PCIe card - Regular functional test + Serdes test:
Extreme - FPS 734 [Frame/Sec], measured on HLS3 server
High - FPS 1242 [Frame/Sec], measured on HLS3 server
HL-328 PCIe card - Regular functional test:
Extreme - FPS 550 [Frame/Sec], measured on HLS3 server
High - FPS 1104 [Frame/Sec], measured on HLS3 server
HL-328 PCIe card - Regular functional test + Serdes test:
Extreme - FPS 550 [Frame/Sec], measured on HLS3 server
High - FPS 1104 [Frame/Sec], measured on HLS3 server
HL-225H - Regular functional test:
Extreme - FPS 621 [Frame/Sec], measured on HLS2 server
High - FPS 729 [Frame/Sec], measured on HLS2 server
HL-225H - Regular functional test + Serdes test:
Extreme - FPS 1233 [Frame/Sec], measured on HLS2 server
High - FPS 1450 [Frame/Sec], measured on HLS2 server
HL-225C - Regular functional test:
Extreme - FPS 621 [Frame/Sec], measured on HLS2 server
High - FPS 657 [Frame/Sec], measured on HLS2 server
HL-225C - Regular functional test + Serdes test:
Extreme - FPS 1200 [Frame/Sec], measured on HLS2 server
High - FPS 1296 [Frame/Sec], measured on HLS2 server
HL-225D - Regular functional test:
Extreme - FPS 585 [Frame/Sec], measured on HLS2 server
High - FPS 585 [Frame/Sec], measured on HLS2 server
HL-225D - Regular functional test + Serdes test:
Extreme - FPS 1151 [Frame/Sec], measured on HLS2 server
High - FPS 1151 [Frame/Sec], measured on HLS2 server
Extreme - FPS 260 [Frame/Sec], measured on HLS1 server
High - FPS 270 [Frame/Sec], measured on HLS1 server
The measurement above is recorded from a 4 minute run.
Functional Test 2 Plugin Switches and Parameters¶
hl_qual -gaudi3 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-f2 -l <extreme | high> [-d] [-dis_val] [-serdes <int | ext>]
Switches and Parameters |
Description |
---|---|
|
Test duration in seconds. |
|
Functional test 2 plugin selector. |
|
Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW. |
|
Disables output tensor validation. The test does not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification. |
|
If the value is not specified, the default value is |
|
Power level selector for 54V power supply. For the power levels, refer to Functional Test 2 Testing Modes. If the value is not specified, the default value is High. |
./hl_qual -gaudi3 -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi3 -c all -rmod parallel -f2 -l extreme -t 450 -serdes
./hl_qual -gaudi3 -c all -rmod parallel -f2 -l extreme -t 450 -serdes int
hl_qual -gaudi2 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-f2 -l <extreme | high> [-d] [-dis_val] [-serdes <int | ext>] [-enable_ports_check <all | int>]
Switches and Parameters |
Description |
---|---|
|
Test duration in seconds. |
|
Functional test 2 plugin selector. |
|
Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW. |
|
Disables output tensor validation. The test does not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification. |
|
If the value is not specified, the default value is |
|
Power level selector. For the power levels, refer to Functional Test 2 Testing Modes. If the value is not specified, the default value is High. |
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes int
hl_qual -gaudi -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-f2 -l <extreme | high> [-d] [-dis_val] [-serdes] [-enable_ports_check <all | int>]
Switches and Parameters |
Description |
---|---|
|
Test duration in seconds. |
|
Functional test 2 plugin selector. |
|
Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW. |
|
Disables output tensor validation. The test does not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification. |
|
Runs allreduce collective operation to test Serdes connectivity along with the other components tested using the functional 2 test. |
|
Power level selector. For the power levels, refer to Functional Test 2 Testing Modes. If the value is not specified, the default value is High. |
|
Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:
|
./hl_qual -gaudi -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi -c all -rmod parallel -f2 -l high -t 450 -serdes