Test Plan Automation

The Test Plan Automation functionality is designed to streamline the execution of hl_qual tests by organizing them into structured workflows and allowing them to be configured via YAML files.

The following terms are defined and used throughout this section:

  • Test plan - A set of hl_qual tests to be run in a single execution flow. A test plan is executed by using configuration YAML files.

  • Default test plan - A set of hl_qual tests configuration YAML files provided in their original form, without any customization.

  • Default test - Test-specific configuration YAML file provided in their original form, without any customization. The default test includes the test name defined by hl_qual, all applicable environment variables initially marked as not_used, and the test switches categorized as mandatory, used, or not_used.

  • Virtual UART - FW log stream that is generated for the host via hl-smi daemon.

The Test Plan Automation functionality contains the following components:

Component

Description

diag_tool_automation.py

The main Python script which runs the Test Plan Automation. See Switches and Usage.

dig_tool.py

The script that analyzes log files and creates reports. See Log Analysis.

scripts

Bash scripts designed for specific tasks. see Bash Scripts.

test_plans

Suggested test plans for hl_qual tests, featuring around 25 different test variants. These test plans serve as a starting point for creating customized test plans. See Configuration YAML Files.

Switches and Usage

The diag_tool_automation.py script executes all the features included in the Test Plan Automation. The table below lists the switches and options available for the script.

Example:

diag_tool_automation.py [-h] --core {gaudi3,gaudi2} --exec {tests_name_list,gen_plan_cfg,gen_test_cfg,validate_yaml,validate_test_plan,run_test_plan}
[--input_path INPUT_PATH] [--output_path OUTPUT_PATH] [--plan_name PLAN_NAME] [--test_name TEST_NAME]
[--postfix POSTFIX] [--disable_prints DISABLE_PRINTS]

Switch Name

Description

--core

Defines the Gaudi core type. Supported options:

  • gaudi3

  • gaudi2

--exec

Executes the Test Plan Automation features. Supported options:

  • tests_name_list - Prints out a list of all the supported hl_qual tests:

    python diag_tool_automation.py --core <gaudi3 | gaudi2>
    --exec tests_name_list
    
  • gen_test_cfg - Generates the default test-specific YAML file. Must be used with the --test_name and --output_path switches:

    python diag_tool_automation.py --core <gaudi3 | gaudi2>
    --exec gen_test_cfg --test_name FunctionalTest
    --output_path <Path to the output folder>
    
  • gen_plan_cfg - Generates the default test plan including the test plan configuration YAML file and YAML files for all hl_qual tests. Must be used with the --plan_name and --output_path switches:

    python diag_tool_automation.py --core <gaudi3 | gaudi2>
    --exec gen_plan_cfg --plan_name very_long_test_plan
    --output_path <Path to the output folder>
    
  • validate_yaml - Validates the test-specific YAML file integrity. Must be used with the --input_path switch:

    python diag_tool_automation.py --core <gaudi3 | gaudi2>
    --exec validate_yaml --input_path <Path to YAML file>
    

    Note

    The screen printout displays an error report if validation encounters issues or a decoded command-line output if validation is successful. See Validation Reports.

  • validate_test_plan - Validates the test plan YAML file integrity. Must be used with the --input_path switch:

    python diag_tool_automation.py --core <gaudi3 | gaudi2>
    --exec validate_test_plan --input_path <Path to YAML file>
    

    Note

    • The validation process verifies only the test-specific YAML files specified by the main test plan YAML file.

    • The screen printout displays an error report if validation encounters issues or a decoded command-line output if validation is successful. See Validation Reports.

  • run_test_plan - Runs the specified test plans and collects logs. Must be used with the --input_path and --output_path switches:

    python diag_tool_automation.py --core <gaudi3 | gaudi2>
    --exec run_test_plan --input_path <Path to YAML file>
    --output_path <Path to log files>
    

    Note

    • If the --output_path is not set, the logs are saved under $HABANA_LOGS.

    • The test_plan_run_results.yaml is saved in the specified --output_path. This file contains the results of the -- exec run_test_plan and serves as the input (-i test_plan_run_results.yaml) for the Log Analysis.

--input_path

Sets the path to the YAML configuration file depending on the selected --exec option.

--output_path

Sets the path to save the run artifacts depending on the selected --exec option.

--plan_name

Sets the test plan name.

--test_name

Sets the test name during the test-specific YAML file generation.

--postfix

Adjusts the test plan name or the test name. If not set, the default name is used:

python diag_tool_automation.py --core <gaudi3 | gaudi2>
--exec gen_test_cfg --test_name FunctionalTest
--output_path <Path to the output folder> --postfix V2

--disable_prints

Disables screen printout during the test plan run.

Note

For further help, run python diag_tool_automation.py -h.

Validation Reports

The following are examples of the validation reports printed when using the --exec validate_yaml and --exec validate_test_plan switches.

  • Test-specific YAML file validation report (success):

    ../../../_images/validate_yaml_passed.jpg
  • Test-specific YAML file validation report (failure):

    ../../../_images/validate_yaml.jpg
  • Test plan YAML validation report:

    ../../../_images/validating_test_plan.jpg

Configuration YAML Files

YAML files are used to configure the Test Plan Automation. The YAML files types and their customization options are described below.

The test plan YAML file defines the general setup of the test plan and provides paths to all test-specific configuration YAML files included in the test plan.

The following is the structure of the test plan YAML file:

  • General configuration:

    • Test plan folder path

    • Test plan name

    • hl_qual bin folder path

    • Enable/disable flag for the virtual UART collection process

  • Test-specific configuration:

    • Test name - The default YAML file assigns a name based on hl_qual, but it can be modified.

    • Test YAML file name - When combined with the test plan folder path, it provides access to all test configuration YAML files (e.g., E2E.yaml).

    • Number of repetitions for the test run.

    • Pre-test and post-test hooks. These can be any of the following:

      • Linux script

      • Linux command

      • Python script execution

Note

The default test plan can be run but highly limited, as many switches in the test-specific configuration YAML file are marked as not-used. Therefore, it is recommended to use the test plan customization.

Test plan YAML example:

test_plan_yaml_dir: /home/lab/trees/npu-stack/qual/diag_tool/automation/test_plans/qual_internal_test_plan/
test-plan-name: qual_internal_test_plan
bin-Folder: /home/lab/builds/qual_release_build/gaudi3/bin
enable-vuart: true
tests:
- test-name: E2E
  test-yaml-path: E2E.yaml
  test-repeat-no: 3
  pre-run: 'sudo dmesg -C'
  post-run: N/A
- test-name: FunctionalTest_extreme
  test-yaml-path: FunctionalTest_extreme.yaml
  test-repeat-no: 6
  pre-run: 'driver_load_unload.sh'
  post-run: N/A

Customization

The following changes are applicable to the test plan YAML file:

  • Test plan:

    • test_plan_yaml_dir - Modify the path to a new location. This is useful when copying the test plan. Make sure that the folder path exists.

    • test-plan-name - Can be changed. Used only in logs and report printouts.

    • bin-Folder - Must point to /opt/habanalabs/qual/gaudi3/bin.

    • enable-vuart - Can be set to either True or False to enable or disable virtual UART collection.

  • Tests entry:

    • test-name - Any string. Used as a logical label in logs and reports.

    • test-yaml-path - Any name, as long as a file with this name exists in test_plan_yaml_dir.

    • test-repeat-no - Accepts any integer between 1 and 10,000.

    • pre-run or post-run - Any Linux script command can be placed here.

    • Test entry commenting and deletion - Each test entry can be commented out or deleted using #. For example:

      #  test-name: FunctionalTest_extreme
      #  test-yaml-path: FunctionalTest_extreme.yaml
      #  test-repeat-no: 6
      #  pre-run: 'driver_load_unload.sh'
      #  post-run: N/A
      

The test-specific YAML file includes environment variables and all relevant switches for the test.

The following is the structure of the test-specific YAML file:

  • Environment variables - Lists all the available environment variables for the specific test. The entries in this section can be added, removed, or commented out. However, if an entry is unrecognized by hl_qual or any other component called by hl_qual, it may have no effect.

    Example:

    env-var:
    - var-name: ENABLE_CONSOLE
      value: 'true'
      is-used: false
    - var-name: LOG_LEVEL_QUAL
      value: '0'
      is-used: false
    - var-name: LOG_LEVEL_ALL
      value: 0
      is-used: false
    
  • Switches - This section is validated by hl_qual. The validation process consists of two stages: first, the automation validation process converts the YAML file into JSON, and then the JSON file is fed to hl_qual for verification.

    Example:

    switches:
    - switch: -gaudi3
      usage-state: mandatory
      description: core type selection switch
    - switch: -c
      usage-state: mandatory
      value: all
      description: 'PCI bus ID, with the applicable range: [all,0000:08:00.0,0000:09:00.0,quad_0]'
    - switch: -dis_mon
      usage-state: used
      description: Disables the monitor display (monitoring is still executed)
    - switch: -dmesg
      usage-state: not_used
      description: 'Adds running dmesg to the qual report (Note: the dmesg will be cleaned)'
    - switch: -enable_serr
      usage-state: used
      description: Enable SERR for ECC error
    - switch: -h
      usage-state: not_used
      description: print help guide
    - switch: -mon_cfg
      usage-state: not_used
      description: ScreenDrawer INI configuration path
    - switch: -rmod
      usage-state: mandatory
      value: parallel
      description: 'Running mode, with the applicable range: [parallel,serial]'
    - switch: -f2
      usage-state: mandatory
      description: Functional test selector
    - switch: -d
      usage-state: not_used
      description: Download input tensors only once, the same tensors will be used throughout all the iterations of the test
    - switch: -enable_ports_check
      usage-state: used
      value: int
      description: 'Enable verifying NIC internal/external ports are up, with the applicable range: [int,all]'
    - switch: -l
      usage-state: mandatory
      value: extreme
      description: 'Power level [extreme,high], with the applicable range: [extreme,high]'
    - switch: -sensors
      usage-state: used
      value: 15
      description: 'Enable sensors collection, -sensors <sample time in seconds> , with the applicable range: [1 - 3600]'
    - switch: -serdes
      usage-state: used
      value: int
      description: 'Enable serdes test [int / ext]. using -serdes ext requires loopback dongles on external ports, with the applicable range: [int,ext]'
    - switch: -t
      usage-state: mandatory
      value: 600
      description: 'Execution time in seconds, with the applicable range: [240 - 259200]'
    - switch: -toggle
      usage-state: used
      description: check toggling
    

Customization

The following changes are applicable to the test-specific YAML file:

  • test-name - Cannot be changed.

  • env-var or switches - Cannot be deleted, commented out, or modified.

  • Switch entries:

    • If the usage-state is mandatory, the entry cannot be deleted or commented out.

    • If the usage-state is used or not_used, the entry can be deleted or commented out.

    • Entries marked as used can be changed to not_used, and vice versa.

  • The range of values that are switched with values must adhere to a predefined range. Refer to the description field for range details.

  • The description can be edited as this filed is ignored by the Test Plan Automation functionality and serves as a help guideline.

Bash Scripts

The Test Plan Automation functionality provides access to pre-defined bash scripts that perform the following operations:

The hard_reset.sh script performs operations by writing to the appropriate sysFS location. Therefore, the driver must be loaded before using this script:

hard_reset.sh -sleep <int value> -n <int value>

Options:

Option

Description

-n <int value>

Specifies the number of test trials to verify if the device is operational. The script pauses for 10 seconds between each trial.

-sleep <int value>

Specifies the number of seconds to wait after all devices are fully operational.

The driver_reload.sh script attempts to load the driver by first unloading it and then reloading it.

driver_reload.sh -sleep <int value> -n <int value> -timeout_locked <int value> -d

Options:

Option

Description

-n <int value>

Specifies the number of test trials to verify if the device is operational. The script pauses for 10 seconds between each trial.

-sleep <int value>

Specifies the number of seconds to wait after all devices are fully operational.

-timeout_locked <int value>

Required only for Gaudi 2. The expected value is 0.

-d

Enables debug mode for the driver.

Note

Before enabling virtual UART option, the driver must be loaded only once. Make sure to set enable-vuart: false when using the driver_reload.sh script in the pre-run stage.

Bash Scripts Usage in a Test Plan

The following is an example of a test plan YAML file that includes the bash scripts. In this example, the driver is loaded during the pre-test stage, which occurs before any other test stage, including enabling UART and capturing dmesg logs. During execution, the NIC_base allreduce test is run five times, with a hard reset performed before each test execution.

test_plan_yaml_dir: /home/lbf/test_plans/qual_test_plan
test-plan-name: qual_test_plan
bin-Folder: /opt/habanalabs/qual/gaudi3/bin
enable-vuart: true
pre-test-plan: 'driver_reload.sh -sleep 60 -n 15 -d'
tests:
- test-name: NIC_BASE_COLLECTIVE_allreduce
  test-yaml-path: NIC_BASE_COLLECTIVE_allreduce.yaml
  test-repeat-no: 5
  pre-run: 'hard_reset.sh  -sleep 60 -n 15'
  post-run: N/A