System Management Interface Tool (hl-smi)

This section describes the system management interface tool, hl-smi. This tool obtains information and monitors data of the Intel® Gaudi® AI accelerators.

hl-smi Utility Options

Note

Run the sudo hl-smi command at root (not in a Docker) to be able to see the Process IDs of all users on a system.

Running hl-smi without an Options argument set displays a summary table of the detected Gaudi devices.

$ hl-smi [<options>]

Use the -h argument to view all Options. The below table presents all the Options available and their description.

Option

Description

-L, –list-aips

Lists all Gaudi devices.

-i, –device <pci addr>

Acts on a specific PCIe device.

-r, –reset-aip

Triggers a reset of the Intel Gaudi AIP. Requires root.

Requires -i switch to target a specific device.

-p, –power-limit <watts>

Sets maximum power limit. Functional only when security is disabled.

-q, –query

Displays information for AIPs in the system.

-d, –display <type>

Displays only one of the following selected information:

PRODUCT, POWER, CLOCK, TEMPERATURE, FAN, BUS,MEMORY, NET, ROW_REPLACEMENT.

-l, –loop [seconds]

Continuously reports query data at the specified interval. The default interval is 5 seconds.

-h, –help

Prints this help message.

-v, –version

Outputs version information and the termination of the utility.

-D

Adds debug prints.

-Q, –query-aip <types>

Selects queries to be presented in csv format.

Possible queries:

timestamp, name, bus_id, driver_version, temperature.aip, module_id, utilization.aip, memory.total, memory.free, memory.used, index, serial, uuid, power.draw, ecc.errors.uncorrected.aggregate.total, ecc.errors.uncorrected.volatile.total, ecc.errors.corrected.aggregate.total, ecc.errors.corrected.volatile.total, ecc.errors.dram.aggregate.total, ecc.mode.current, ecc.mode.pending, stats.violation.power, stats.violation.thermal.

–query-row-replacement <types>

Displays row replacement info according to type selections:

uuid, replaced_rows.address, replaced_rows.cause.

-f, –format <types>

Displays the required information in csv format:

csv (mandatory), noheader, nounits

-c, –cpu

Gets CPU ideal affinity for device. Note: Relevant only with ‘topo’.

-N, –numa

Gets NUMA affinity for device. Note: Relevant only with ‘topo’.

-n, –nic <ports|link|stats>

Gets NIC ports info (internal/external), link state (up/down) or statistics for the requested internal port(s). nic stats is the equivalent to ethtool -S <port>.

-P, –port [ports]

Selected ports (optional) for NIC info. In case ports are not specified, data will be retrieved for all ports.

hl-smi -Q Option

The following table describes --query-aip <types>:

Name

Description

Format

Timestamp

The timestamp at the time in which the hl-smi was invoked. The timezone depends on the system that the hl-smi runs on.

Day-of-week Month Day HH:MM:SS Year”. Example: Wed Oct 21 15:38:16 IDT 2020

Name

The name of the Intel Gaudi board.

Alphanumeric string. Example: HL202

bus_id

The PCI address of the AIP.

Domain(DDDD)::bus number(BB)::device number(DD)::function number(FF). Example: 0000:01:00.0

driver_version

The host driver version.

Release number(XX.YY.ZZ)-SHA1-commit(XXXXXX). Example: 0.11.0-44980077

temperature.aip

Maximum temperature read from the four available temperature sensors on the SoC.

Maximum temperature in degrees Celsius. Example: 40 C

utilization.aip

Returns a simple utilization measurement which checks if any of the available HW components was busy over a period of 1000[msec]..

Utilization is given in percent format. Example: 80%

memory.total

The total size of available memory including free and used memory.

[Value] in MiB. Example: 32768 MiB

memory.free

Available size of unused memory.

[Value] in MiB. Example: 32256 MiB

memory.used

The size of used memory.

[Value] in MiB. Example: 512 MiB

module_id

Location of card in board.

Decimal string. Example: 0

index

Index of the AIP in the host.

A number (0,1,….). Example: 0

serial

The unique serial number of the AIP.

YYXXXXXXXX : 2 letters and 9 numbers. Example: AJ42038679

uuid

The universally unique 128-bit identifier of the SoC.

Alphanumeric string made up of table_version, device_ID, FAB#, LOT#, Wafer#, X Coordinate, Y Coordinate. Example: 00P1-HL2000B0-14-P63B83-04-08-10

power.draw

The total power consumption of the AIP device is calculated using current and voltage samples. Power draw is current multiplied by voltage.

[Value] W. Example: 40W

ecc.errors.uncorrected.aggregate.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.errors.uncorrected.volatile.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR. The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected.

[Counter]. Example: 0

ecc.errors.corrected.aggregate.total

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.errors.corrected.volatile.total

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR. The total volatile number of ECC errors is counted from the time a file descriptor is opened.

[Counter]. Example: 0

ecc.errors.dram.aggregate.total

Number of uncorrected HBM errors. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.mode.current

The ECC mode that the AIP is currently operating under.

Enabled/Disabled

ecc.mode.pending

The ECC mode that the AIP will operate on after the next reboot.

Enabled/Disabled

stats.violation.power

Duration of latest power related throttling event per device (ns).

XXXXX nsec

stats.violation.thermal

Duration of latest thermal related throttling event per device (ns).

XXXXX nsec

hl-smi Command Line Examples

The below is an example of the tool command line:

$ hl-smi –q # if invoked as root user shows all N/A fields

The following shows examples of the tool command line with csv format:

Example 1:

/hl-smi -Q timestamp,bus_id,memory.used -f csv
timestamp, bus_id, memory.used
Sun Oct 27 16:27:39 IST 201, 0000:01:00.0, 536870912 B

Example 2:

hl-smi -Q timestamp,memory.used,memory.free -f csv,nounits -l
timestamp, memory.free, memory.used
Sun Oct 27 16:30:32 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:37 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:42 IST 201, 3758096384, 536870912

Note

After running the query, copy the results and save in a file with .csv ending (results.csv). Open the file from “Excel” to view the table:

../../_images/excel_table.jpg

Example 3:

$ hl-smi topo -c -N
modID    CPU Affinity    NUMA Affinity
-----    ------------    -------------
0        0-27, 56-83      0
1        28-55, 84-111    1
2        28-55, 84-111    1
3        0-27, 56-83      0

Example 4:

$ hl-smi -n ports  -i  0000:08:00.0
port 0: external
port 1: internal
port 2: internal
port 3: internal
port 4: internal
port 5: internal
port 6: internal
port 7: internal
port 8: external
port 9: external

Example 5:

$ hl-smi --nic=link  -i  0000:08:00.0 -P 7,3,2
port 2: DOWN
port 3: UP
port 7: DOWN

Example 6:

$ hl-smi --query-row-replacement uuid,replaced_rows.address,replaced_rows.cause -f csv
uuid, address.hbm_idx, address.pc, address.sid, address.bank_idx, address.row_addr, replaced_rows.cause
00P3-HL2000B0-14-P63X13-03-03-07, 1, 14, 0, 0, 0, Single Bit ECC
00P3-HL2000B0-14-P63X13-03-03-07, 3, 10, 0, 0, 0, Double Bit ECC

Example 7:

    $ hl-smi -q -d ROW_REPLACEMENT

 ================ HL-SMI LOG ================

 Timestamp               : Mon Nov  1 12:13:16 IST 2021

 Driver Version              : 1.2.0-5502e1915
HL-SMI Version               : hl-1.1.0-fw-32.3.0-6-g20afbf28-dirty (Sep 12 2021 - 10:49:45)

 Attached AIPs               : 1

 [0] AIP (hl0) 0000:01:00.0
     Replaced Rows
         Single Bit ECC      : 1
         Double Bit ECC      : 1
         Pending             : Yes

Running hl-smi as Daemon

You can run hl-smi as daemon, hl-smi dmon, using -i or --device <pci-addr>:

Option

Description

-i, –device <pci-addr>

Open PTS for firmware of requested PCIe device

Running hl-smi with topo Option

You can run hl-smi with topo, hl-smi topo, using -c or -N:

Option

Description

-c, –cpu

Get CPU ideal affinity for device. Note: Relevant only with ‘topo’.

-N, –numa

Get NUMA affinity for device. Note: Relevant only with ‘topo’.

Example :

$ hl-smi topo -c -N
modID    CPU Affinity    NUMA Affinity
-----    ------------    -------------
0        0-27, 56-83      0
1        28-55, 84-111    1
2        28-55, 84-111    1
3        0-27, 56-83      0