Functional Test Plugins Design, Switches and Parameters
On this Page
Functional Test Plugins Design, Switches and Parameters¶
This section describes plugin specific switches, however, it will not focus on the common switches although these switches will be mentioned here for the completeness of the command examples. To see the common plugin switches and parameters, refer to hl_qual Common Plugin Switches and Parameters.
Functional tests verify the full chip functionality while running several chip hardware modules in parallel and in a synchronized manner. All the tests described in this section verify the accuracy of calculation in parallel to performance metrics in the form of a measured frame per second. Executing the accuracy check on the host does not affect the FPS measurement of the test.
ResNet-50 Training Stress Test Plugin Design Consideration and Responsibilities¶
Note
The ResNet-50 training stress test plugin is applicable for both first-gen Gaudi and Gaudi2.
Before running this plugin, make sure to set the following environment variable:
export __python_cmd=python3
.
The ResNet-50 training stress test plugin runs a functional ResNet-50 training test as a real life training scenario. The test verifies accuracy and performance. To enable an accuracy check, you must supply a full ImageNet training dataset.
ResNet-50 Training Stress Test Plugin Testing Modes¶
The ResNet-50 training stress test plugin has two testing modes, each with different batch size options:
64 batch size
256 batch size
Random Data vs ImageNet verification:
Random Data - The test can run on random data tensors. When using this mode, the accuracy check is skipped and only the achievable FPS is taken into account in the pass/fail criteria. It enables pure FPS test without depending on the image augmenter. This test is limited to 1000 iterations.
ImageNet - When using ImageNet dataset, the test evaluates accuracy and FPS. The FPS could be influenced by the image augmenter AEON in case of first-gen Gaudi and Habana Media Pipe in case of Gaudi2. For more details, refer to Accuracy and FPS Evaluation.
Note
Running the test with ImageNet dataset is the default mode. If you want to run it with random data, use -rand
switch.
The suggested number of training epochs should not exceed 90. The ResNet50 training app should converge with that range. 90 epoch runs for 20-21 hours.
Preconditions¶
The trainingApp test uses the AEON augmenter which could be a limiting factor on the achievable FPS results when running on multiple devices.
Before running this testing mode with ImageNet dataset on first-gen Gaudi, make sure you have the following:
96 CPU cores per server (Minimum), evenly distributed between the NUMA nodes.
A successful pass of the Memory Bandwidth Test and the PCI Bandwidth Test.
512 GB of RAM.
Accuracy and FPS Evaluation¶
When running this test with ImageNet testing mode, the FPS results may be influenced by the different augmentation and image preprocessing mechanisms used for first-gen Gaudi and Gaudi2:
First-gen Gaudi - Uses AEON augmenter which utilizes the host CPU.
Gaudi2 - Uses the Habana Media Pipe for Jpeg decoding and image preprocessing. The preprocessing of the ImageNet dataset is done on on the decoder HW path and it does not require CPU resources.
ResNet-50 Training Stress Test Plugin Configuration Files and Requirements¶
Before using the plugin, make sure to perform the following:
Download the training/validation dataset from ImageNet.
Download the labels file from ImageNet Object Localization Challenge page.
Untar the imagenet tar file (
ILSVRC2012_img_train.tar
,ILSVRC2012_img_val.tar
).Change your current directory to hl_qual bin directory.
Run the preparation script -
prepare.sh
(included in the package). This script untars all the tar files (ILSVRC2012_img_train.tar
file), and generates the training list file (train_list.txt
).
./prepare.sh -m MODE -d EXTRACTED_DIR -f LABEL_FILE [-h]`
Parameters:
MODE
- ‘train’ or ‘val’.EXTRACTED_DIR
- The path to the directory that contains the untared files fromILSVRC2012_img_train.tar
file.LABEL_FILE
- The path toLOC_synset_mapping.txt
file.
Note
IMPORTANT: Expected execution time depends on the number of epochs configured for the test run. Each epoch can take up to 18 minutes.
Make a copy of the following files as they can be used after installing a new package on the same setup:
train_list.txt
training256.json
training64.json
Reinstalling the package or relaunching the script will override/overwrite those files.
ResNet-50 and Linux File Cache Considerations¶
Linux OS has a file cache that accelerates IO operations. Once a file has been read, it is loaded into the host RAM’s file cache so that subsequent reads of the same file will be faster. As a result, when running the training plugin with the full ImageNet validations set, which has the size of 140GB, the first training epoch performance could be lower than expected. To address this issue, perform one of the following:
Run a dummy training run for a full epoch and ignore the failure. During this dummy run, the full ImageNet dataset is uploaded to the Linux cache to ensure that the following run achieves the expected performance.
Use the iocache_loader application supplied in the hl_qual installation folder to upload the ImageNet dataset to the Linux cache.
Note
After rebooting the host, the Linux file cache is cleared so that one of the above options can be executed to achieve the expected performance.
Running iocache_loader Application¶
The following is iocache_loader command line interface. To receive the applicable switches, run the command without any parameters.
$ ./iocache_loader
path to training dataset must be supplied
./iocache_loader -p <path> -t <num of threads> -e
-p <path> - path to dataset - this is mandatory
-t <num of threads> - number of threads default 20
-e enable output printouts
For example:
./iocache_loader -p /user/imagenet/train -t 40
ResNet-50 Training Stress Test - Pass/Fail Criteria¶
The performance and accuracy test evaluates the loss function received from the device. If the loss function shows an unexpected behavior, the test fails.
Accuracy - loss function decreases monotonically without sharp jumps between training tests, indicating that NaN values have been propagated into the training process calculations.
Performance (FPS) - the performance [images/sec] is calculated per training epoch. The expected results per device:
FPS: 5750 images/sec Epoch runtime 3.6 minutes
FPS: 1580 images/sec Epoch runtime: 13.3 minutes
Expected accuracy: for 90 epoch run - 0.743.
ResNet-50 Training Stress Test Plugin Switches and Parameters¶
The following lists the training test plugin switches and parameters:
**hl_qual -gaudi -gaudi2 -c <pci bus id> [-rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-trainingApp [-bs <batch size 64 | 256>] [-epoch <number of epochs>] [-n <number of iterations>] [-rand] **
Switches and Parameters |
Description |
---|---|
|
Training test plugin selector. |
|
Defines the training batch size.
If the value is not specified, the default value is 256. |
|
Defines epoch count. |
|
Defines iteration count. You must provide -epoch flag with this flag. Example: -epoch 1 -n 1000. |
|
Random input generation. This mode disables accuracy and loss validation. Only fps is calculated. The test uses 1500 iterations preset. |
|
Writes statistics to file. |
./hl_qual -gaudi -c all -rmod parallel -trainingApp -bs 256 -rand
./hl_qual -gaudi -c all -rmod parallel -trainingApp -bs 256 -epoch 3
./hl_qual -gaudi2 -c all -rmod parallel -trainingApp -bs 256 -epoch 3
Functional Test 2 Plugin Design Consideration and Responsibilities¶
Note
Functional test 2 plugin is applicable for both first-gen Gaudi and Gaudi2.
Functional test 2 runs all hardware components on the first-gen Gaudi and Gaudi2 SOCs simultaneously in parallel mode to test their functionality and interaction between the different units.
The functional test 2 uses a synthetic topology based on multiple operations. In these operations, all computational units and memory space are utilized while maintaining high power usage.
The test can run for long hours (more than 2) testing the following device functionalities:
Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms can be checked while running the plugin in extreme load.
PID and clock relaxation mechanisms verification.
Long work periods in typical power levels.
Full bit-exact calculation.
Tested units:
PCI links
DMA engines – moving data between:
PCI ==> HBM, HBM ==>PCI
HBM ==> SRAM, SRAM==>HBM
MME engines
TPC engines
Serdes connectivity - only when using -serdes switch
Functional Test 2 Synthetic Topology¶
The functional test 2 plugin builds a test topology including large tensors and multiple operators (Conv, Batchnorm).
The test application runs the topology by injecting pre-calculated inputs and verifies the output against a pre-calculated reference for each topology execution to ensure bit-exact results.
When applying the -serdes switch, the topology includes full serdes transmit/receive verification in addition to the basic topology.
Functional Test 2 Testing Modes¶
The test verifies the calculation result on each topology execution while enabling high power usage. All execution steps are verified for this purpose.
The test also verifies execution throughput [executions/seconds] against the performance metrics in the form of measured frame per second (FPS).
The functional test 2 contains the following sub-test modes:
Extreme - measured power level for 54V power supply: 530-560 [watt]
High – measured power level for 54V power supply: 450-490 [watt]
Extreme - measured power level: 345-355 [watt]
Mid – measured power level: 230-240 [watt]
The measurement above is recorded from a 4 minute run. It can change depending on the on the ambient condition of the system (fan speed, server box configuration and ambient temperatures).
Note
The initialization stage can take up to 170 seconds. This time is needed to recalculate and generate the reference expected output tensors, compile the test topology and test runtime execution calibration. When using -t switch, the init time is not included in the test running duration that you have set.
Functional Test - Pass/Fail Criteria¶
The pass/fail criteria is composed of the following:
The calculated value of each topology launch must be identical to a pre-calculated reference.
The execution throughput [executions/seconds] must not fall below an existing predefined threshold:
Extreme - FPS 750 [Frame/Sec], measured on HLS2 server
High – FPS 880 [Frame/Sec], measured on HLS2 server
Extreme - FPS 260 [Frame/Sec]
High – FPS 270 [Frame/Sec]
The measurement above is recorded from a 4 minute run.
Functional Test 2 Plugin Switches and Parameters¶
hl_qual -gaudi2 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-f2 -l <extreme | high> [-d] [-dis_val] [-serdes] [-enable_ports_check <all | int>]
Switches and Parameters |
Description |
---|---|
|
Test duration in seconds. |
|
Functional test 2 plugin selector. |
|
Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW. |
|
Disables output tensor validation. The test will not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification. |
|
Runs allreduce collective operation to test serdes connectivity along with the other components tested using the functional 2 test. |
|
Power level selector for 54V power supply:
|
|
Indicates whether the ports are UP or DOWN. If they are are DOWN, it will result in a test failure.
|
hl_qual -gaudi -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
-f2 -l <extreme | mid> [-d] [-dis_val] [-serdes] [-enable_ports_check <all | int>]
Switches and Parameters |
Description |
---|---|
|
Test duration in seconds. |
|
Functional test 2 plugin selector. |
|
Download once option. The input tensors are downloaded at the beginning of the test and reused for each test run. This switch is useful when you suspect that functional test performance degradation is due to PCI low BW. |
|
Disables output tensor validation. The test will not fail on bit-exact test, but it may fail on low FPS. When using this switch, the test performance becomes higher as the data is not uploaded to the host for verification. |
|
Runs allreduce collective operation to test serdes connectivity along with the other components tested using the functional 2 test. |
|
Power level selector:
If the value is not specified, the default value is MID. |
|
Indicates whether the ports are UP or DOWN. If they are are DOWN, it will result in a test failure.
|
For example:
./hl_qual -gaudi -c all -rmod parallel -f2 -d -l mid
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450
./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 450 -serdes