Power Stress and EDP Tests Plugins Design, Switches and Parameters

This section describes plugin specific switches only. Common plugin switches and parameters are described in hl_qual Common Plugin Switches and Parameters.

Gaudi 3/2 Power Stress Plugin Design Consideration and Responsibilities

The power stress plugin runs a power stress test, putting the device in constant and equal level power load for an extended time (more than two hours). The following device functionalities are tested:

  • Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms. These are tested while running the power stress plugin in extreme load.

  • PID module which is responsible for power management and keeping power usage below allowable MAX power. See Supported Power Levels.

Prerequisites

For Gaudi 2 only: Before running this test, load the driver:

sudo modprobe habanalabs timeout_locked=0

Supported Power Levels

The power stress plugin performs a maximum power stress on the device. The test plugin supports the following power levels:

Extreme - Measured power level: 850-870 [watts]

Note

The above power measurements were achieved on HL-325 device and are peak power levels measured by the HLML API.

  • Gaudi 2 - 225H/225C:

    • Extreme - Measured power level: 530-560 [watts]

    • High - Measured power level: 450-490 [watts]

  • Gaudi 2 - 225D:

    • Extreme - Measured power level: 370 [watts]

    • High - Measured power level: 260 [watts]

Note

The above power measurements were achieved on an HL-225H device and are peak power levels measured by the HLML API.

Power Stress Test - Pass/Fail Criteria

The pass/fail criteria consists of the following:

  • Test must not trigger the builtin temperature protection.

  • MME engine calculation must be bit-exact.

First-gen Gaudi Power Stress Plugin Design Consideration and Responsibilities

The power stress plugin runs a power stress test, putting the device in constant and equal level power load for an extended time (more than two hours). The following device functionalities are tested:

  • Thermal stress test, cooling system functionality, temperature dissipation and thermal protection mechanisms. These are tested while running the power stress plugin in extreme load.

  • Power limiter and clock relaxation mechanisms. The power limiter is a mechanism that limits the power usage below 300 [watts]. When the power limit is met, the device clocks are lowered. To test the power limiter mechanism, the plugin must run at an extreme power level.

  • Long work periods in typical power levels (extreme, high).

Power usage and reported temperature may differ according to the tested server as they depend on the ambient condition of the server (fan speed, ambient temperature and number of devices been tested).

The power stress plugin conducts multi-level power stress test. The test plugin supports the following power levels:

  • Extreme - Measured power level: 345 - 355 [watts]

  • High – Measured power level: 230 - 240 [watts]

Note

  • The above power measurements were achieved on an HL-205 device and are peak power levels measured by the HLML API.

  • Tests should run longer than 30 seconds to reach maximum power.

Power Stress Test - Pass/Fail Criteria

The power stress tests must run until completion without overheating or power supply failures.

Power Stress Test Plugin Switches and Parameters

hl_qual -gaudi3 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] -s

Switches and Parameters

Description

-t

Power stress test duration in seconds. If this switch is omitted, the default value is 60 seconds.

-s

Power stress test selector.

./hl_qual -gaudi3 -c all -rmod parallel -s -t 120
hl_qual -gaudi2 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] -s [-l <extreme | high>]

Switches and Parameters

Description

-t

Power stress test duration in seconds. If this switch is omitted, the default value is 60 seconds.

-s

Power stress test selector.

-l <extreme | high>

Power level selector for 54V power supply:

  • extreme - 530-560 [w] measured on HL-225H

  • high - 450-490 [w] measured on HL-225H

If the value is not specified, the default value is High.

./hl_qual -gaudi2 -c all -rmod parallel -s -l extreme -t 120
hl_qual -gaudi -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] -s [-l <extreme | high>]

Switches and Parameters

Description

-t

Power stress test duration in seconds. If this switch is omitted, the default value is 40 seconds.

-s

Power stress test selector.

-l <extreme | high>

Power level selector:

  • extreme - 345-355 [w] measured on HL-205

  • high - 230-240 [w] measured on HL-205

If the value is not specified, the default value is High.

./hl_qual -gaudi -c all -rmod parallel -s -l extreme -t 120

EDP Stress Test Plugin Design Consideration and Responsibilities

The EDP stress test verifies the functionality of the power supply by generating a fast power usage transient from low power to high power and vice versa. The power cycles repeat throughout the test’s execution time.

../../_images/EDP_Test_Power_Cycles.PNG

Figure 18 EDP Stress Test Power Cycles

The test’s configurable parameters, -tw/-ts <time>, set the duration of the power cycle which consists of high-power level and low-power level (idle state) usage. The high power levels are configurable via the -l switch (Gaudi 2 and first-gen Gaudi).

Prerequisites

For Gaudi 2 only: Before running this test, load the driver:

sudo modprobe habanalabs timeout_locked=0

EDP Stress Test - Pass/Fail Criteria

The EDP stress test must run until completion without overheating or power supply failures.

EDP Stress Test Plugin Switches and Parameters

./hl_qual -gaudi3 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] -e [-sync] [-Tw <time in seconds>] [-Ts <time in seconds>]

Switches and Parameters

Description

-t

Power stress test duration in seconds. If this switch is omitted, the default value is 40 seconds.

-e

EDP test selector.

-sync

Enables device sync. When using this switch, the rising edge of the power cycle is synchronized between all devices.

-Tw <time in seconds>

Sets the duration of the high power time for the EDP power cycle which is measured in seconds.

-Ts <time in seconds>

Sets the duration of the idle time for the EDP power cycle which is measured in seconds.

The below command line executes the EDP stress test for 40 seconds and runs 10 power cycles. Each power cycle runs 3 seconds of high power usage and 1 second of low power usage.

./hl_qual -gaudi3 -c all -rmod parallel -t 40 -e -Tw 3 -Ts 1
./hl_qual -gaudi2 -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -e [-sync] [-inc_power] [-enable_ports_check <all | int>] [-Tw <time in seconds>] [-Ts <time in seconds>] [-l <extreme | high>]

Switches and Parameters

Description

-t

Power stress test duration in seconds. If this switch is omitted, the default value is 40 seconds.

-e

EDP test selector.

-l <extreme | high>

Power level selector for 54V power supply:

  • extreme - 530-560 [w] measured on HL-225H

  • high - 450-490 [w] measured on HL-225H

If the value is not specified, the default value is High.

-sync

Enables device sync. When using this switch, the rising edge of the power cycle is synchronized between all devices.

-inc_power

Enables NIC execution which ensures high power usage.

-enable_ports_check <all / int>

Indicates whether the ports are UP or DOWN. If the ports are DOWN, the test fails:

  • all - Checks all the external and internal ports.

  • int - Checks the internal ports only.

-Tw <time in seconds>

Sets the duration of the high power time for the EDP power cycle which is measured in seconds.

-Ts <time in seconds>

Sets the duration of the idle time for the EDP power cycle which is measured in seconds.

The below command executes the EDP stress test for 40 seconds and runs 10 power cycles. Each power cycle runs 3 seconds of high power usage and 1 second of low power usage:

./hl_qual -gaudi2 -c all -rmod parallel -t 40 -e -l extreme -Tw 3 -Ts 1
hl_qual -gaudi -c <pci bus id> [-t <time in seconds>] -rmod <serial | parallel> [-dis_mon] [-mon_cfg <monitor INI path>]
      -e [-tw <time in seconds>] [-ts <time in seconds>] [-l <extreme | high>]

Switches and Parameters

Description

-t

Power stress test duration in seconds.

-e

EDP test selector.

-l <extreme | high>

Power level selector:

  • extreme - 345-355 [w] measured on HL-205

  • high- 250-260 [w] measured on HL-205

If the value is not specified, the default value is High.

-tw <time in milliseconds>

Time duration of high power usage in the EDP test power cycle. tw should be a multiple of 200 ms. hl_qual rounds them to the nearest multiple of 200 ms.

-ts <time in milliseconds>

Time duration of low power usage (idle mode) in the EDP test power cycle. tw should be a multiple of 200 ms. hl_qual rounds them to the nearest multiple of 200 ms.

The below command executes the EDP stress test for 40 seconds and runs 10 power cycles. Each power cycle runs 2 seconds of high power usage and 2 seconds of low power usage.

./hl_qual -gaudi -c all -rmod parallel -t 40 -e -l extreme -tw 2000 -ts 2000