System Management Interface Tool (hl-smi)

This section describes the system management interface tool, hl-smi. This tool obtains information and monitors data of the Intel® Gaudi® AI accelerators.

hl-smi Utility Options

Note

Run the sudo hl-smi command at root (not in a Docker) to be able to see the Process IDs of all users on a system.

Running hl-smi without an Options argument set displays a summary table of the detected Gaudi devices.

$ hl-smi [<options>]

Use the -h argument to view all Options. The below table presents all the Options available and their description.

Option

Description

-L, –list-aips

Lists all Gaudi devices.

-i, –device <pci addr>

Acts on a specific PCIe device.

–fw-versions

Displays versions of the FW components.

-r, –reset-aip

Triggers a reset of the Intel Gaudi AIP. Requires root.

Requires -i switch to target a specific device.

-p, –power-max <watts>

Sets maximum power limit. For Gaudi 3:

  • Minimum power limit: 450W

  • Maximum power limit: 850W

-q, –query

Displays information for AIPs in the system.

-d, –display <type>

Displays only one of the following selected information:

PRODUCT, POWER, CLOCK, TEMPERATURE, FAN, BUS,MEMORY, NET, ROW_REPLACEMENT.

-l, –loop [seconds]

Continuously reports query data at the specified interval. The default interval is 5 seconds.

-h, –help

Prints the help message.

-v, –version

Outputs version information and the termination of the utility.

-D

Adds debug prints.

-Q, –query-aip <types>

Selects queries to be presented in CSV format.

Possible queries:

timestamp, name, bus_id, driver_version, temperature.aip, module_id, utilization.aip, memory.total, memory.free, memory.used, index, serial, uuid, power.draw, ecc.errors.uncorrected.aggregate.total, ecc.errors.uncorrected.volatile.total, ecc.errors.corrected.aggregate.total, ecc.errors.corrected.volatile.total, ecc.errors.dram.aggregate.total, ecc.errors.dram-corrected.aggregate.total, ecc.errors.dram.volatile.total, ecc.errors.dram-corrected.volatile.total, ecc.mode.current, ecc.mode.pending, stats.violation.power, stats.violation.thermal. clocks.current.soc, clocks.max.soc clocks.limit.soc, clocks.limit.tpc pcie.link.gen.max, pcie.link.gen.current, pcie.link.width.max, pcie.link.width.current, pcie.link.speed.max, pcie.link.speed.current, ecc.errors.hbm.sram.critical.

–query-row-replacement <types>

Displays row replacement information according to type selections:

uuid, replaced_rows.address, replaced_rows.cause.

-f, –format <types>

Displays the required information in CSV format:

csv (mandatory), noheader, nounits

-n, –nic <ports|link|stats>

Retrieves NIC port information (internal/ external), including link state (up/down) or statistics for the requested internal port(s). The nic stats is equivalent to ethtool -S <port>.

-P, –port [ports]

Specifies the selected ports (optional) for NIC information. If no ports are specified, data will be retrieved for all ports.

-t, –tpm

Provides data for further TPM validation. The output includes:

nonce, quote, quote_sig, public_key, iak_der_cert, dev_info, dev_info_sig, idev_id_public_key, idev_id_der_cert

hl-smi -Q Option

The following table describes --query-aip <types>:

Name

Description

Format

Timestamp

The timestamp at the time in which the hl-smi was invoked. The timezone depends on the system that the hl-smi runs on.

Day-of-week Month Day HH:MM:SS Year”. Example: Wed Oct 21 15:38:16 IDT 2020

Name

The name of the Intel Gaudi board.

Alphanumeric string. Example: HL202

bus_id

The PCI address of the AIP.

Domain(DDDD)::bus number(BB)::device number(DD)::function number(FF). Example: 0000:01:00.0

driver_version

The host driver version.

Release number(XX.YY.ZZ)-SHA1-commit(XXXXXX). Example: 0.11.0-44980077

temperature.aip

Maximum temperature read from the four available temperature sensors on the SoC.

Maximum temperature in degrees Celsius. Example: 40 C

utilization.aip

Returns a simple utilization measurement which checks if any of the available HW components was busy over a period of 1000[msec]..

Utilization is given in percent format. Example: 80%

memory.total

The total size of available memory including free and used memory.

[Value] in MiB. Example: 32768 MiB

memory.free

Available size of unused memory.

[Value] in MiB. Example: 32256 MiB

memory.used

The size of used memory.

[Value] in MiB. Example: 512 MiB

module_id

Location of card in board.

Decimal string. Example: 0

index

Index of the AIP in the host.

A number (0,1,….). Example: 0

serial

The unique serial number of the AIP.

YYXXXXXXXX : 2 letters and 9 numbers. Example: AJ42038679

uuid

The universally unique 128-bit identifier of the SoC.

Alphanumeric string made up of table_version, device_ID, FAB#, LOT#, Wafer#, X Coordinate, Y Coordinate. Example: 00P1-HL2000B0-14-P63B83-04-08-10

power.draw

The total power consumption of the AIP device is calculated using current and voltage samples. Power draw is current multiplied by voltage. The reported power data reflects the power consumption of the 54V power rail.

[Value] W. Example: 40W

ecc.errors.uncorrected.aggregate.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.errors.uncorrected.volatile.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR. The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected.

[Counter]. Example: 0

ecc.errors.corrected.aggregate.total

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.errors.corrected.volatile.total

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR. The total volatile number of ECC errors is counted from the time a file descriptor is opened.

[Counter]. Example: 0

ecc.errors.dram.aggregate.total

Number of uncorrected HBM errors. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.mode.current

The ECC mode that the AIP is currently operating under.

Enabled/Disabled

ecc.mode.pending

The ECC mode that the AIP will operate on after the next reboot.

Enabled/Disabled

stats.violation.power

Duration of latest power related throttling event per device (ns).

XXXXX nsec

stats.violation.thermal

Duration of latest thermal related throttling event per device (ns).

XXXXX nsec

Note

Gaudi 3 and Gaudi 2 have two power rails: 54V and 12V. The power.draw reported by hl-smi -Q reflects the power consumption of the 54V power rail only. The power consumption (Pwr: Usage/Cap) reported by hl-smi includes the combined power usage of both the 54V and 12V power rails. Therefore, the power data from the hl-smi -Q command with the power.draw query is typically lower than that obtained by running the hl-smi command.

Running hl-smi as Daemon

You can run hl-smi as daemon, hl-smi dmon, using -i or --device <pci-addr>:

Option

Description

-i, –device <pci-addr>

Open PTS for firmware of requested PCIe device

Running hl-smi with topo Option

You can run hl-smi with topo, hl-smi topo, using -c or -N:

Option

Description

-c, –cpu

Get CPU ideal affinity for device. Note: Relevant only with ‘topo’.

-N, –numa

Get NUMA affinity for device. Note: Relevant only with ‘topo’.

Example:

$ hl-smi topo -c -N
modID    CPU Affinity    NUMA Affinity
-----    ------------    -------------
0        0-27, 56-83      0
1        28-55, 84-111    1
2        28-55, 84-111    1
3        0-27, 56-83      0

hl-smi Command Line Examples

The below is an example of the tool command line:

$ hl-smi –q # if invoked as root user shows all N/A fields

The following shows examples of the tool command line with csv format:

Example 1:

/hl-smi -Q timestamp,bus_id,memory.used -f csv
timestamp, bus_id, memory.used
Sun Oct 27 16:27:39 IST 201, 0000:01:00.0, 536870912 B

Example 2:

hl-smi -Q timestamp,memory.used,memory.free -f csv,nounits -l
timestamp, memory.free, memory.used
Sun Oct 27 16:30:32 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:37 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:42 IST 201, 3758096384, 536870912

Note

After running the query, copy the results and save in a file with .csv ending (results.csv). Open the file from “Excel” to view the table:

../../_images/excel_table.jpg

Example 3:

$ hl-smi topo -c -N
modID    CPU Affinity    NUMA Affinity
-----    ------------    -------------
0        0-27, 56-83      0
1        28-55, 84-111    1
2        28-55, 84-111    1
3        0-27, 56-83      0

Example 4:

$ hl-smi -n ports  -i  0000:08:00.0
port 0: external
port 1: internal
port 2: internal
port 3: internal
port 4: internal
port 5: internal
port 6: internal
port 7: internal
port 8: external
port 9: external

Example 5:

$ hl-smi --nic=link  -i  0000:08:00.0 -P 7,3,2
port 2: DOWN
port 3: UP
port 7: DOWN

Example 6:

$ hl-smi --query-row-replacement uuid,replaced_rows.address,replaced_rows.cause -f csv
uuid, address.hbm_idx, address.pc, address.sid, address.bank_idx, address.row_addr, replaced_rows.cause
00P3-HL2000B0-14-P63X13-03-03-07, 1, 14, 0, 0, 0, Single Bit ECC
00P3-HL2000B0-14-P63X13-03-03-07, 3, 10, 0, 0, 0, Double Bit ECC

Example 7:

   $ hl-smi -q -d ROW_REPLACEMENT

================ HL-SMI LOG ================

Timestamp               : Mon Nov  1 12:13:16 IST 2021

Driver Version              : 1.2.0-5502e1915
HL-SMI Version              : hl-1.1.0-fw-32.3.0-6-g20afbf28-dirty (Sep 12 2021 - 10:49:45)

Attached AIPs               : 1

[0] AIP (hl0) 0000:01:00.0
    Replaced Rows
        Single Bit ECC      : 1
        Double Bit ECC      : 1
        Pending             : Yes