Test Plan Automation¶

The Test Plan Automation functionality is designed to streamline the execution of hl_qual tests by organizing them into structured workflows and allowing them to be configured via YAML files.

The following terms are defined and used throughout this section:

Test plan - A set of hl_qual tests to be run in a single execution flow. A test plan is executed by using configuration YAML files.
Default test plan - A set of hl_qual tests configuration YAML files provided in their original form, without any customization.
Default test - Test-specific configuration YAML file provided in their original form, without any customization. The default test includes the test name defined by hl_qual, all applicable environment variables initially marked as not_used, and the test switches categorized as mandatory, used, or not_used.
Virtual UART - FW log stream that is generated for the host via hl-smi daemon.

The Test Plan Automation functionality contains the following components:

Component	Description
`diag_tool_automation.py`	The main Python script which runs the Test Plan Automation. See Switches and Usage.
`diag_tool.py`	The script that analyzes log files and creates reports. See Log Analysis.
`scripts`	Bash scripts designed for specific tasks. see Bash Scripts.
`test_plans`	Suggested test plans for hl_qual tests, featuring around 25 different test variants. These test plans serve as a starting point for creating customized test plans. See Configuration YAML Files.

Switches and Usage¶

The diag_tool_automation.py script executes all the features included in the Test Plan Automation. The table below lists the switches and options available for the script.

Example:

diag_tool_automation.py [-h] --core {gaudi3,gaudi2} --exec {tests_name_list,gen_plan_cfg,gen_test_cfg,validate_yaml,validate_test_plan,run_test_plan}
[--input_path INPUT_PATH] [--output_path OUTPUT_PATH] [--plan_name PLAN_NAME] [--test_name TEST_NAME][--postfix POSTFIX]
[--disable_prints DISABLE_PRINTS][--hpu_only_dmesg][--enable_sudo_collection]

Switch Name	Description
`--core`	Defines the Gaudi core type. Supported options: `gaudi3` `gaudi2`
`--exec`	Executes the Test Plan Automation features. Supported options: `tests_name_list` - Prints out a list of all the supported hl_qual tests: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec tests_name_list `gen_test_cfg` - Generates the default test-specific YAML file. Must be used with the `--test_name` and `--output_path` switches: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec gen_test_cfg --test_name FunctionalTest --output_path <Path to the output folder> `gen_plan_cfg` - Generates the default test plan including the test plan configuration YAML file and YAML files for all hl_qual tests. Must be used with the `--plan_name` and `--output_path` switches: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec gen_plan_cfg --plan_name very_long_test_plan --output_path <Path to the output folder> `validate_yaml` - Validates the test-specific YAML file integrity. Must be used with the `--input_path` switch: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec validate_yaml --input_path <Path to YAML file> Note The screen printout displays an error report if validation encounters issues or a decoded command-line output if validation is successful. See Validation Reports. `validate_test_plan` - Validates the test plan YAML file integrity. Must be used with the `--input_path` switch: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec validate_test_plan --input_path <Path to YAML file> Note The validation process verifies only the test-specific YAML files specified by the main test plan YAML file. The screen printout displays an error report if validation encounters issues or a decoded command-line output if validation is successful. See Validation Reports. `run_test_plan` - Runs the specified test plans and collects logs. Must be used with the `--input_path` and `--output_path` switches: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec run_test_plan --input_path <Path to YAML file> --output_path <Path to log files> Note If the `--output_path` is not set, the logs are saved under $HABANA_LOGS. The `test_plan_run_results.yaml` is saved in the specified `--output_path`. This file contains the results of the `-- exec run_test_plan` and serves as the input (`-i test_plan_run_results.yaml`) for the Log Analysis.
`--input_path`	Sets the path to the YAML configuration file depending on the selected `--exec` option.
`--output_path`	Sets the path to save the run artifacts depending on the selected `--exec` option.
`--plan_name`	Sets the test plan name.
`--test_name`	Sets the test name during the test-specific YAML file generation.
`--postfix`	Adjusts the test plan name or the test name. If not set, the default name is used: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec gen_test_cfg --test_name FunctionalTest --output_path <Path to the output folder> --postfix V2 --hpu_only_dmesg --enable_sudo_collection
`--disable_prints`	Disables screen printout during the test plan run.
`--hpu_only_dmesg`	Enables the dmesg collector to log only HPU-related messages - specifically, messages tagged with `accel` or `habanalabs`: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec gen_test_cfg --test_name FunctionalTest --output_path <Path to the output folder> --postfix V2 --hpu_only_dmesg --enable_sudo_collection
`--enable_sudo_collection`	Enables execution of the dmesg command when superuser permissions are required: python diag_tool_automation.py --core <gaudi3 \| gaudi2> --exec gen_test_cfg --test_name FunctionalTest --output_path <Path to the output folder> --postfix V2 --hpu_only_dmesg --enable_sudo_collection

Note

For further help, run python diag_tool_automation.py -h.

Validation Reports¶

The following are examples of the validation reports printed when using the --exec validate_yaml and --exec validate_test_plan switches.

Test-specific YAML file validation report (success):
Test-specific YAML file validation report (failure):
Test plan YAML validation report:

Configuration YAML Files¶

YAML files are used to configure the Test Plan Automation. The YAML files types and their customization options are described below. The YAML files examples for Gaudi 3 and Gaudi 2 can be found under the /habanalabs/qual/diag_tool/test_plans/ folder. The examples provided in this document are for Gaudi 3.

Test Plan YAML File

The test plan YAML file defines the general setup of the test plan and provides paths to all test-specific configuration YAML files included in the test plan.

The following is the structure of the test plan YAML file:

General configuration:

Test plan folder path

Test plan name

hl_qual bin folder path

Enable/disable flag for the virtual UART collection process

Test-specific configuration:

Test name - The default YAML file assigns a name based on hl_qual, but it can be modified.

Test YAML file name - When combined with the test plan folder path, it provides access to all test configuration YAML files (e.g., E2E.yaml).

Number of repetitions for the test run.

Pre-test and post-test hooks. These can be any of the following:

Linux script

Linux command

Python script execution

Note

The default test plan can be run but highly limited, as many switches in the test-specific configuration YAML file are marked as not-used. Therefore, it is recommended to use the test plan customization.

To prevent a test from being executed, one of the following methods can be used:

Comment out the test entry in the test plan YAML file.

Delete the test entry from the test plan YAML file.

Set the test-repeat-no parameter to 0.

Test plan YAML example:
test_plan_yaml_dir: /opt/habanalabs/qual/diag_tool/test_plans/gaudi3/qual_test_plan
test-plan-name: qual_test_plan
bin-Folder: /opt/habanalabs/qual/gaudi3/bin
enable-vuart: true
tests:
- test-name: E2E
  test-yaml-path: E2E.yaml
  test-repeat-no: 3
  pre-run: 'sudo dmesg -C'
  post-run: N/A
- test-name: FunctionalTest_extreme
  test-yaml-path: FunctionalTest_extreme.yaml
  test-repeat-no: 6
  pre-run: 'driver_load_unload.sh'
  post-run: N/A

Customization

The following changes are applicable to the test plan YAML file:
Test plan:

test_plan_yaml_dir - Modify the path to a new location. This is useful when copying the test plan. Make sure that the folder path exists.

test-plan-name - Can be changed. Used only in logs and report printouts.

bin-Folder - Must point to /opt/habanalabs/qual/gaudi3/bin.

enable-vuart - Can be set to either True or False to enable or disable virtual UART collection.
Tests entry:
test-name - Any string. Used as a logical label in logs and reports.

test-yaml-path - Any name, as long as a file with this name exists in test_plan_yaml_dir.

test-repeat-no - Accepts any integer between 0 and 10,000. Zero value means that the test is not executed.

pre-run or post-run - Any Linux script command can be placed here.
Test entry commenting and deletion - Each test entry can be commented out or deleted using #. For example:
#  test-name: FunctionalTest_extreme
#  test-yaml-path: FunctionalTest_extreme.yaml
#  test-repeat-no: 6
#  pre-run: 'driver_load_unload.sh'
#  post-run: N/A

Test-specific YAML File

The test-specific YAML file includes environment variables and all relevant switches for the test.

The following is the structure of the test-specific YAML file:

Environment variables - Lists all the available environment variables for the specific test. The entries in this section can be added, removed, or commented out. However, if an entry is unrecognized by hl_qual or any other component called by hl_qual, it may have no effect.

Example:
```
env-var:
- var-name: ENABLE_CONSOLE
  value: 'true'
  is-used: false
- var-name: LOG_LEVEL_QUAL
  value: '0'
  is-used: false
- var-name: LOG_LEVEL_ALL
  value: 0
  is-used: false
```

Switches - This section is validated by hl_qual. The validation process consists of two stages: first, the automation validation process converts the YAML file into JSON, and then the JSON file is fed to hl_qual for verification.

Example:

switches:
- switch: -gaudi3
  usage-state: mandatory
  description: core type selection switch
- switch: -c
  usage-state: mandatory
  value: all
  description: 'PCI bus ID, with the applicable range: [all,0000:08:00.0,0000:09:00.0,quad_0]'
- switch: -dis_mon
  usage-state: used
  description: Disables the monitor display (monitoring is still executed)
- switch: -dmesg
  usage-state: not_used
  description: 'Adds running dmesg to the qual report (Note: the dmesg will be cleaned)'
- switch: -enable_serr
  usage-state: used
  description: Enable SERR for ECC error
- switch: -h
  usage-state: not_used
  description: print help guide
- switch: -mon_cfg
  usage-state: not_used
  description: ScreenDrawer INI configuration path
- switch: -rmod
  usage-state: mandatory
  value: parallel
  description: 'Running mode, with the applicable range: [parallel,serial]'
- switch: -f2
  usage-state: mandatory
  description: Functional test selector
- switch: -d
  usage-state: not_used
  description: Download input tensors only once, the same tensors will be used throughout all the iterations of the test
- switch: -enable_ports_check
  usage-state: used
  value: int
  description: 'Enable verifying NIC internal/external ports are up, with the applicable range: [int,all]'
- switch: -l
  usage-state: mandatory
  value: extreme
  description: 'Power level [extreme,high], with the applicable range: [extreme,high]'
- switch: -sensors
  usage-state: used
  value: 15
  description: 'Enable sensors collection, -sensors <sample time in seconds> , with the applicable range: [1 - 3600]'
- switch: -serdes
  usage-state: used
  value: int
  description: 'Enable serdes test [int / ext]. using -serdes ext requires loopback dongles on external ports, with the applicable range: [int,ext]'
- switch: -t
  usage-state: mandatory
  value: 600
  description: 'Execution time in seconds, with the applicable range: [240 - 259200]'
- switch: -toggle
  usage-state: used
  description: check toggling

Customization

The following changes are applicable to the test-specific YAML file:

test-name - Cannot be changed.

env-var or switches - Cannot be deleted, commented out, or modified.

Switch entries:

If the usage-state is mandatory, the entry cannot be deleted or commented out.

If the usage-state is used or not_used, the entry can be deleted or commented out.

Entries marked as used can be changed to not_used, and vice versa.

The range of values that are switched with values must adhere to a predefined range. Refer to the description field for range details.

The description can be edited as this filed is ignored by the Test Plan Automation functionality and serves as a help guideline.

Bash Scripts¶

The Test Plan Automation functionality provides access to pre-defined bash scripts that perform the following operations:

hard_reset

The hard_reset.sh script performs operations by writing to the appropriate sysFS location. Therefore, the driver must be loaded before using this script:

hard_reset.sh -sleep <int value> -n <int value>

Options:

Option	Description
`-n <int value>`	Specifies the number of test trials to verify if the device is operational. The script pauses for 10 seconds between each trial.
`-sleep <int value>`	Specifies the number of seconds to wait after all devices are fully operational.

driver_reload

The driver_reload.sh script attempts to load the driver by first unloading it and then reloading it.

driver_reload.sh -sleep <int value> -n <int value> -timeout_locked <int value> -d

Options:

Option	Description
`-n <int value>`	Specifies the number of test trials to verify if the device is operational. The script pauses for 10 seconds between each trial.
`-sleep <int value>`	Specifies the number of seconds to wait after all devices are fully operational.
`-timeout_locked <int value>`	Required only for Gaudi 2. The expected value is 0.
`-d`	Enables debug mode for the driver.

Note

Before enabling virtual UART option, the driver must be loaded only once. Make sure to set enable-vuart: false when using the driver_reload.sh script in the pre-run stage.

Bash Scripts Usage in a Test Plan¶

The following is an example of a test plan YAML file that includes the bash scripts. In this example, the driver is loaded during the pre-test stage, which occurs before any other test stage, including enabling UART and capturing dmesg logs. During execution, the NIC_base allreduce test is run five times, with a hard reset performed before each test execution.

test_plan_yaml_dir: /home/lbf/test_plans/qual_test_plan
test-plan-name: qual_test_plan
bin-Folder: /opt/habanalabs/qual/gaudi3/bin
enable-vuart: true
pre-test-plan: 'driver_reload.sh -sleep 60 -n 15 -d'
tests:
- test-name: NIC_BASE_COLLECTIVE_allreduce
  test-yaml-path: NIC_BASE_COLLECTIVE_allreduce.yaml
  test-repeat-no: 5
  pre-run: 'hard_reset.sh  -sleep 60 -n 15'
  post-run: N/A

Output Folder Structure¶

The output folder for a test plan is generated under the path provided in the run command. Each test plan run creates a main output folder with the following naming convention: <server_name>_<test_plan_name>_<date_time>. The following shows the main folder structure:

Folder	Description	Naming Format
Red (Base Folder)	Root output folder for the test plan run.	`<server_name>_<test_plan_name>_<date_time>`
Green (dmesg.log)	Log file containing dmesg messages with timestamps collected throughout the test plan execution.	`dmesg.log`
Orange (Test Result Folders)	Individual folders for each test run. Test name is taken from the test plan YAML file.	`<test_name>_<rep_num>_<date_time>`
Purple (Summary File)	Summary of all tests executed, including: - Test type - Test name - Test configuration YAML file path - Test status - Test output folder path	`test_plan_run_results_<date_time>`
Yellow (UART Logs Folder)	Contains UART log files per device with timestamps.	`uart_<bus_id>.log`

Test-Specific Folder Structure¶

Each test-specific folder contains results for an individual test and includes the following files:

File	Description	Naming Format
hl_qual Report File	Log file generated by hl_qual summarizing the test run.	`<server_name>_hl_qual_report_<day>_<month>_<time>_<year>.log`
Clock Logs	CSV file containing clock samples with timestamps.	`clock_log_<bus_id>.csv`
Temperature Logs	CSV file with temperature readings and timestamps.	`temp_log_<bus_id>.csv`
Power Logs	CSV file logging power data with timestamps.	`power_log_<bus_id>.csv`
Sensor Logs	Generated using the `-sensors` switch in hl_qual. Contains sensor data with timestamps as generated by the lm-sensors Linux command (minimum interval: 10s).	`sensors_HL325L-pci-2100.txt`
Port Toggle Counters	CSV file with toggle counters per port.	`toggle_counters_<bus_id>.csv`

Gaudi Documentation 1.21.1 documentation

Test Plan Automation

On this Page

Test Plan Automation¶

Switches and Usage¶

Validation Reports¶

Configuration YAML Files¶

Bash Scripts¶

Bash Scripts Usage in a Test Plan¶

Output Folder Structure¶

Test-Specific Folder Structure¶