Rack Scale Script

The Rack Scale script for Diagnostic tool is designed to execute the diag_tool_automation.py test plan in parallel across multiple nodes in a rack-scale environment when using hl_qual.

This tool parses node URIs from a host file and runs the remote Diagnostic tool test command concurrently using pdsh with the ssh remote command module. A custom SSH port can be specified to accommodate firewall configurations. The default is 22.

When the tests complete, all result artifacts can be collected from the nodes to either the master node or a remote analyzer node using the --output_path <remote_uri>:/path/to/result option in the Diagnostic tool.

diag_tool_automation_rack_scale

Environment Setup

  1. Set up the Diagnostic tool environment on all nodes in the cluster. For detailed setup instructions, refer to :ref:``Test_Plan_Automation` section.

  2. Clone the source code:

    git clone https://github.com/chaos/pdsh.git
    cd pdsh
    bootstrap
    ./configure --with-ssh
    make && make install
    
  3. Install pdsh on the master node:

    sudo apt install pdsh
    
    sudo yum install pdsh
    
  4. Set up SSH authentication. Use automation/ssh_key_authentication.py script to establish SSH key-based authentication between the master node and all slave nodes in the cluster.

  5. Prepare a host file. The hostfile.txt must contain two sections:

    • Master – at least one master node entry

    • Slave – one or more slave node entries

    Example:

    # hostfile.txt template
    [Master]
    user@master_ip
    
    [Slave]
    user@slave_ip1
    ...
    user@slave_ipN
    
  6. Run the SSH key authentication script on the master node:

    python automation/ssh_key_authentication.py -f <path_to_hostfile> [-k <key_type>] [-p <port>]
    

    Options

    Option

    Description

    -f / --hostfile

    Path to the hostfile.txt

    -k / --key_type

    (Optional) SSH key type: dsa, ecdsa, ecdsa-sk, ed25519, ed25519-sk, or rsa (default: rsa)

    -p / --port

    (Optional) SSH port (default: 22)

    Note

    When using a custom SSH port, make sure it is enabled in /etc/ssh/sshd_config.

    Example Output:

    2025-07-09 15:12:25,209 - INFO - Master: master_uri
    2025-07-09 15:12:25,209 - INFO - Slaves: ['slave_uri']
    2025-07-09 15:12:25,209 - INFO - Key Type: ed25519
    2025-07-09 15:12:25,209 - INFO - ############### Start distribute ssh keys ###############
    2025-07-09 15:12:25,359 - INFO - SSH key generated on master_uri
    ...
    2025-07-09 15:12:26,245 - INFO - verify_ssh_connectivity done.
    

Options and Usage

The diag_tool_automation_rack_scale.py script can be found under /opt/habanalabs/qual/diag_tool/. The following is a run command example:

python diag_tool_automation_rack_scale.py -c <diag_tool_command> [-f <path_to_hostfile>]

Options

Option

Description

-c / --command

Diagnostic tool test plan execution command.

-f / --hostfile

Path to the host file. The default is hostfile.txt.

Note

For further help, run python diag_tool_automation_rack_scale.py -h.

Example

The following example demonstrates how to run the Diagnostic tool in a rack-scale development environment:

export PDSH_SSH_ARGS="-p <ssh_port>"  # (Optional) Set SSH port (default: 22)
export REMOTE_URI="remote_uri"
export DIAG_TOOL_COMMAND="python /opt/habanalabs/qual/diag_tool/diag_tool_automation.py \
  --exec run_test_plan \
  --input_path /opt/habanalabs/qual/diag_tool/test_plans/E2E.yaml \
  --output_path $REMOTE_URI:/var/log/habanalabs \
  --core gaudi2 \
  --uri_key_path ~/.ssh/id_ed25519"

cd <QUAL_SRC>/diag_tool
python diag_tool_automation_rack_scale.py -c "$DIAG_TOOL_COMMAND" -f hostfile.txt

For more information on generating and executing test plans, see :ref:``Test_Plan_Automation` section.