Test Plan Automation
On this Page
Test Plan Automation¶
The Test Plan Automation functionality is designed to streamline the execution of hl_qual tests by organizing them into structured workflows and allowing them to be configured via YAML files.
The following terms are defined and used throughout this section:
Test plan - A set of hl_qual tests to be run in a single execution flow. A test plan is executed by using configuration YAML files.
Default test plan - A set of hl_qual tests configuration YAML files provided in their original form, without any customization.
Default test - Test-specific configuration YAML file provided in their original form, without any customization. The default test includes the test name defined by hl_qual, all applicable environment variables initially marked as
not_used
, and the test switches categorized asmandatory
,used
, ornot_used
.Virtual UART - FW log stream that is generated for the host via hl-smi daemon.
The Test Plan Automation functionality contains the following components:
Component |
Description |
---|---|
|
The main Python script which runs the Test Plan Automation. See Switches and Usage. |
|
The script that analyzes log files and creates reports. See Log Analysis. |
|
Bash scripts designed for specific tasks. see Bash Scripts. |
|
Suggested test plans for hl_qual tests, featuring around 25 different test variants. These test plans serve as a starting point for creating customized test plans. See Configuration YAML Files. |
Switches and Usage¶
The diag_tool_automation.py
script executes all the features included in the Test Plan Automation.
The table below lists the switches and options available for the script.
Example:
diag_tool_automation.py [-h] --core {gaudi3,gaudi2} --exec {tests_name_list,gen_plan_cfg,gen_test_cfg,validate_yaml,validate_test_plan,run_test_plan}
[--input_path INPUT_PATH] [--output_path OUTPUT_PATH] [--plan_name PLAN_NAME] [--test_name TEST_NAME][--postfix POSTFIX]
[--disable_prints DISABLE_PRINTS][--hpu_only_dmesg][--enable_sudo_collection]
Switch Name |
Description |
---|---|
|
Defines the Gaudi core type. Supported options:
|
|
Executes the Test Plan Automation features. Supported options:
|
|
Sets the path to the YAML configuration file depending on the
selected |
|
Sets the path to save the run artifacts depending on the
selected |
|
Sets the test plan name. |
|
Sets the test name during the test-specific YAML file generation. |
|
Adjusts the test plan name or the test name. If not set, the default name is used:
|
|
Disables screen printout during the test plan run. |
|
Enables the dmesg collector to log only HPU-related messages -
specifically, messages tagged with
|
|
Enables execution of the dmesg command when superuser permissions are required:
|
Note
For further help, run python diag_tool_automation.py -h
.
Validation Reports¶
The following are examples of the validation reports printed when using the --exec validate_yaml
and --exec validate_test_plan
switches.
Test-specific YAML file validation report (success):
Test-specific YAML file validation report (failure):
Test plan YAML validation report:
Configuration YAML Files¶
YAML files are used to configure the Test Plan Automation. The YAML files types and their customization options are described below.
The YAML files examples for Gaudi 3 and Gaudi 2 can be found under the /habanalabs/qual/diag_tool/test_plans/
folder.
The examples provided in this document are for Gaudi 3.
The test plan YAML file defines the general setup of the test plan and provides paths to all test-specific configuration YAML files included in the test plan.
The following is the structure of the test plan YAML file:
General configuration:
Test plan folder path
Test plan name
hl_qual bin folder path
Enable/disable flag for the virtual UART collection process
Test-specific configuration:
Test name - The default YAML file assigns a name based on hl_qual, but it can be modified.
Test YAML file name - When combined with the test plan folder path, it provides access to all test configuration YAML files (e.g., E2E.yaml).
Number of repetitions for the test run.
Pre-test and post-test hooks. These can be any of the following:
Linux script
Linux command
Python script execution
Note
The default test plan can be run but highly limited, as many switches in the test-specific configuration YAML file are marked as
not-used
. Therefore, it is recommended to use the test plan customization.To prevent a test from being executed, one of the following methods can be used:
Comment out the test entry in the test plan YAML file.
Delete the test entry from the test plan YAML file.
Set the
test-repeat-no
parameter to0
.Test plan YAML example:
test_plan_yaml_dir: /opt/habanalabs/qual/diag_tool/test_plans/gaudi3/qual_test_plan test-plan-name: qual_test_plan bin-Folder: /opt/habanalabs/qual/gaudi3/bin enable-vuart: true tests: - test-name: E2E test-yaml-path: E2E.yaml test-repeat-no: 3 pre-run: 'sudo dmesg -C' post-run: N/A - test-name: FunctionalTest_extreme test-yaml-path: FunctionalTest_extreme.yaml test-repeat-no: 6 pre-run: 'driver_load_unload.sh' post-run: N/A
Customization
The following changes are applicable to the test plan YAML file:
Test plan:
test_plan_yaml_dir
- Modify the path to a new location. This is useful when copying the test plan. Make sure that the folder path exists.
test-plan-name
- Can be changed. Used only in logs and report printouts.
bin-Folder
- Must point to/opt/habanalabs/qual/gaudi3/bin
.
enable-vuart
- Can be set to either True or False to enable or disable virtual UART collection.Tests entry:
test-name
- Any string. Used as a logical label in logs and reports.
test-yaml-path
- Any name, as long as a file with this name exists intest_plan_yaml_dir
.
test-repeat-no
- Accepts any integer between 0 and 10,000. Zero value means that the test is not executed.
pre-run
orpost-run
- Any Linux script command can be placed here.Test entry commenting and deletion - Each test entry can be commented out or deleted using #. For example:
# test-name: FunctionalTest_extreme # test-yaml-path: FunctionalTest_extreme.yaml # test-repeat-no: 6 # pre-run: 'driver_load_unload.sh' # post-run: N/A
The test-specific YAML file includes environment variables and all relevant switches for the test.
The following is the structure of the test-specific YAML file:
Environment variables - Lists all the available environment variables for the specific test. The entries in this section can be added, removed, or commented out. However, if an entry is unrecognized by hl_qual or any other component called by hl_qual, it may have no effect.
Example:
env-var: - var-name: ENABLE_CONSOLE value: 'true' is-used: false - var-name: LOG_LEVEL_QUAL value: '0' is-used: false - var-name: LOG_LEVEL_ALL value: 0 is-used: falseSwitches - This section is validated by hl_qual. The validation process consists of two stages: first, the automation validation process converts the YAML file into JSON, and then the JSON file is fed to hl_qual for verification.
Example:
switches: - switch: -gaudi3 usage-state: mandatory description: core type selection switch - switch: -c usage-state: mandatory value: all description: 'PCI bus ID, with the applicable range: [all,0000:08:00.0,0000:09:00.0,quad_0]' - switch: -dis_mon usage-state: used description: Disables the monitor display (monitoring is still executed) - switch: -dmesg usage-state: not_used description: 'Adds running dmesg to the qual report (Note: the dmesg will be cleaned)' - switch: -enable_serr usage-state: used description: Enable SERR for ECC error - switch: -h usage-state: not_used description: print help guide - switch: -mon_cfg usage-state: not_used description: ScreenDrawer INI configuration path - switch: -rmod usage-state: mandatory value: parallel description: 'Running mode, with the applicable range: [parallel,serial]' - switch: -f2 usage-state: mandatory description: Functional test selector - switch: -d usage-state: not_used description: Download input tensors only once, the same tensors will be used throughout all the iterations of the test - switch: -enable_ports_check usage-state: used value: int description: 'Enable verifying NIC internal/external ports are up, with the applicable range: [int,all]' - switch: -l usage-state: mandatory value: extreme description: 'Power level [extreme,high], with the applicable range: [extreme,high]' - switch: -sensors usage-state: used value: 15 description: 'Enable sensors collection, -sensors <sample time in seconds> , with the applicable range: [1 - 3600]' - switch: -serdes usage-state: used value: int description: 'Enable serdes test [int / ext]. using -serdes ext requires loopback dongles on external ports, with the applicable range: [int,ext]' - switch: -t usage-state: mandatory value: 600 description: 'Execution time in seconds, with the applicable range: [240 - 259200]' - switch: -toggle usage-state: used description: check toggling
Customization
The following changes are applicable to the test-specific YAML file:
test-name
- Cannot be changed.
env-var
orswitches
- Cannot be deleted, commented out, or modified.Switch entries:
If the
usage-state
ismandatory
, the entry cannot be deleted or commented out.If the
usage-state
isused
ornot_used
, the entry can be deleted or commented out.Entries marked as
used
can be changed tonot_used
, and vice versa.The range of values that are switched with values must adhere to a predefined range. Refer to the
description
field for range details.The description can be edited as this filed is ignored by the Test Plan Automation functionality and serves as a help guideline.
Bash Scripts¶
The Test Plan Automation functionality provides access to pre-defined bash scripts that perform the following operations:
The hard_reset.sh
script performs operations by writing to the appropriate sysFS location.
Therefore, the driver must be loaded before using this script:
hard_reset.sh -sleep <int value> -n <int value>
Options:
Option |
Description |
---|---|
|
Specifies the number of test trials to verify if the device is operational. The script pauses for 10 seconds between each trial. |
|
Specifies the number of seconds to wait after all devices are fully operational. |
The driver_reload.sh
script attempts to load the driver by first unloading it and then reloading it.
driver_reload.sh -sleep <int value> -n <int value> -timeout_locked <int value> -d
Options:
Option |
Description |
---|---|
|
Specifies the number of test trials to verify if the device is operational. The script pauses for 10 seconds between each trial. |
|
Specifies the number of seconds to wait after all devices are fully operational. |
|
Required only for Gaudi 2. The expected value is 0. |
|
Enables debug mode for the driver. |
Note
Before enabling virtual UART option, the driver must be loaded only once. Make sure to set enable-vuart: false
when using the driver_reload.sh
script in the pre-run stage.
Bash Scripts Usage in a Test Plan¶
The following is an example of a test plan YAML file that includes the bash scripts. In this example, the driver is loaded during the pre-test stage, which occurs before any other test stage, including enabling UART and capturing dmesg logs. During execution, the NIC_base allreduce test is run five times, with a hard reset performed before each test execution.
test_plan_yaml_dir: /home/lbf/test_plans/qual_test_plan
test-plan-name: qual_test_plan
bin-Folder: /opt/habanalabs/qual/gaudi3/bin
enable-vuart: true
pre-test-plan: 'driver_reload.sh -sleep 60 -n 15 -d'
tests:
- test-name: NIC_BASE_COLLECTIVE_allreduce
test-yaml-path: NIC_BASE_COLLECTIVE_allreduce.yaml
test-repeat-no: 5
pre-run: 'hard_reset.sh -sleep 60 -n 15'
post-run: N/A
Output Folder Structure¶
The output folder for a test plan is generated under the path provided in the run command.
Each test plan run creates a main output folder with the following naming convention: <server_name>_<test_plan_name>_<date_time>
.
The following shows the main folder structure:
Folder |
Description |
Naming Format |
---|---|---|
Red (Base Folder) |
Root output folder for the test plan run. |
|
Green (dmesg.log) |
Log file containing dmesg messages with timestamps collected throughout the test plan execution. |
|
Orange (Test Result Folders) |
Individual folders for each test run. Test name is taken from the test plan YAML file. |
|
Purple (Summary File) |
Summary of all tests executed, including: - Test type - Test name - Test configuration YAML file path - Test status - Test output folder path |
|
Yellow (UART Logs Folder) |
Contains UART log files per device with timestamps. |
|
Test-Specific Folder Structure¶
Each test-specific folder contains results for an individual test and includes the following files:
File |
Description |
Naming Format |
---|---|---|
hl_qual Report File |
Log file generated by hl_qual summarizing the test run. |
|
Clock Logs |
CSV file containing clock samples with timestamps. |
|
Temperature Logs |
CSV file with temperature readings and timestamps. |
|
Power Logs |
CSV file logging power data with timestamps. |
|
Sensor Logs |
Generated using the |
|
Port Toggle Counters |
CSV file with toggle counters per port. |
|