hl_qual Design

Overview

This section describes the general design of the hl_qual package and the responsibilities of each software component. In this section, two terms are used:

  • Test plugin - Describes a testing library intended to perform a specific testing task. For example, power stress test plugin.

  • Runner - An application responsible to load and run a specific test plugin. The hl_qual runs one runner per tested device.

Each plugin conforms to specific design considerations as further described in this section. All plugins are implemented as dynamically linked libraries which have a common API.

As shown in Package Content, the hl_qual package is composed out of hl_qual application, runner and plugin libraries specified in Fig. 21. The hl_qual application is the portal for running all test plugins by forking a process per test plugin and per device.

../../_images/The_Running_Hierarchy_Between_hl_qual_and_Plugins.jpg

Figure 21 The Running Hierarchy between hl_qual and Plugins

The hl_qual and runner are run as separate processes. The hl_qual forks new processes, a single process per tested device, and then loads the image of the runner application. This is implemented to ensure a new image without any common IO, IPC resources between hl_qual and the runner application.

The runner then decodes the passed parameters via its command line and loads the required test plugin. After loading the plugin, it executes the run function to execute the desired test.

Inter Process Communication

hl_qual and the runner process communicate via multiple IPC structures.

Pipes

  • Result pipe - Returns the test results pass/fail.

  • Message pipe - Moves all messages, printout performed by the test plugin as part of the init and runs phases including results metrics.

hl_qual opens a result pipe and a message pipe per runner process and per tested device.

The pipes are limited in size, but hl_qual runner manager reads data from the pipes to free the pipe and enable log execution test. The message pipe reads out from each device and is redirected to the final hl_qual report to enable a full view of the test results for each device.

Control Room

Control room is a shared memory synchronization structure that enables the following features:

  • Synchronized test plugin execution starts via barrier.

  • State progress notification - init, main run, finalize.

  • Main loop test progress monitoring - progress bar which monitors the test completion percentage.

hl_qual is responsible to construct the control room. The synchronization objects are passed to the runner and they test libraries via shared memory.

Metric Monitor

The monitor module has two roles:

  • Readout of device metrics and measurements such as power usage, clocks, temperature and memory errors which will be included in the hl_qual report.

  • Textual display showing the measured metrics and the test progress per device.

The monitor is also supplied as a standalone application to enable the option of monitoring the device while running other loads on Gaudi devices. For further details on the monitor, refer to hl_qual Monitor.

Test Running Mode

hl_qual can run the test plugins on multiple devices using one of either two possible running modes:

  • Serial mode - In this mode, the test plugins run one by one, with each test plugin starting once the test on a previous tested device is complete. Only one Gaudi device is tested at a time. This running mode is recommended when running the PCI bandwidth test to calculate the peak achievable PCI bandwidth to/from the host to the device.

  • Parallel mode - In this mode, all test plugin processes are forked simultaneously and run in parallel, with each test plugin running on the Gaudi device allocated for it. The execution of the test plugin is not synchronized, which means the start and end execution of each plugin may occur at different times. The hl_qual application considers testing completed once the last running test plugin ends.

Pass/Fail Criteria Considerations

Each test plugin is responsible for setting and checking the pass/fail criteria. The hl_qual application receives a pass/fail indication from each test plugin but is not responsible for the interpretation of the test results. This allows the hl_qual application to have a similar interface to all the test plugins.

hl_qual Design and Responsibilities

The hl_qual application is the glue logic responsible for running multiple tests on multiple devices. The following list describes the tasks that the hl_qual executes while running these tests:

  • Identification of available Gaudi devices in the system.

  • Generation of a command line for the test plugins. hl_qual identifies and validates the different switches.

  • Running the runner application on all available devices under the running modes described in this section. This includes a forking process and opening message and results pipe per forked process, loading the runner image and packing the runner command line.

  • Capturing system parameters for all available devices. This includes temperature, clock and power usage and displaying them on screen using the monitor add-on.

  • Waiting for test results from all running test plugins (Pass/Fail indications).

  • Reading periodically the result and message pipes.

  • Running the monitor thread.

  • Constructing the control room.

  • Handling signal CTRL-C and sending it to all runners to enable clean closure of the application.

  • General error checking all the devices that are under test. This includes reading PCI AER and reading all devices Single/Double Errors on internal device memories.

Runner Design and Responsibilities

The runner is an application that runs per tested device. It is forked from hl_qual and loaded with the runner image to ensure safe run without dragging pre-generated hl_qual process IPC, IO and global variables. The following list describes the tasks that the Runner Design executes while running these tests:

  • Decoding the command line switches.

  • Identification of the plugin to be loaded.

  • Loading the plugin library dynamically.

  • Verifying that all required parameters are compliant with plugin parameters using the plugin verify API function.

  • Execution of the plugin run function.

  • Redirecting all standard output to the message pipe.

  • Capturing signals send from hl_qual process (CTL-C).