System Management Interface Tool (hl-smi)

This section describes the system management interface tool. This tool obtains information and monitors data of the Intel® Gaudi® AI accelerators.

hl-smi Utility Options

Note

Run the sudo hl-smi command at root (not in a Docker) to be able to see the Process IDs of all users on a system.

Running hl-smi without an Options argument set displays a summary table of the detected Gaudi devices.

$ hl-smi [<options>]

Use the -h argument to view all Options. The below table presents all the Options available and their description.

Option

Description

-L, –list-aips

Lists all Gaudi devices.

-i, –device <pci addr>

Acts on a specific PCIe device.

-r, –reset-aip

Triggers a reset of the Intel Gaudi AIP. Requires root.

Requires -i switch to target specific device.

-p, –power-limit <watts>

Sets maximum power limit.

-q, –query

Displays information for AIPs in the system.

-d, –display <type>

Displays only one of the following selected information:

PRODUCT, POWER, CLOCK, TEMPERATURE, FAN, BUS,MEMORY, NET, ROW_REPLACEMENT.

-l, –loop [seconds]

Continuously, reports query data at the specified interval. The default interval is 5 seconds.

-h, –help

Prints this help message.

-v, –version

Outputs version information and the termination of the utility.

-D

Adds debug prints.

-Q, –query-aip <types>

Selects queries to be presented in csv format.

Possible queries:

timestamp, name, bus_id, driver_version, temperature.aip, module_id, utilization.aip, memory.total, memory.free, memory.used, index, serial, uuid, power.draw, ecc.errors.uncorrected.aggregate.total, ecc.errors.uncorrected.volatile.total, ecc.errors.corrected.aggregate.total, ecc.errors.corrected.volatile.total, ecc.errors.dram.aggregate.total, ecc.mode.current, ecc.mode.pending, stats.violation.power, stats.violation.thermal.

–query-row-replacement <types>

Displays row replacement info according to type selections:

uuid, replaced_rows.address, replaced_rows.cause.

-f, –format <types>

Displays the required information in csv format:

csv (mandatory), noheader, nounits

-c, –cpu

Gets CPU ideal affinity for device. Note: Relevant only with ‘topo’.

-N, –numa

Gets NUMA affinity for device. Note: Relevant only with ‘topo’.

-n, –nic <ports|link|stats>

Gets NIC ports info (internal/external) OR link state (up/down) OR statistics for the requested internal port(s). ‘nic stats’ is the equivalent for: “ethtool -S <port>”

-P, –port [ports]

Selected ports (optional) for NIC info. In case ports are not specified, data will be retrieved for all ports.

hl-smi -Q Option

The following table describes –query-aip <types>:

Name

Description

Format

Timestamp

The timestamp at the time in which the hl-smi was invoked. The date and time format is “Day-of-week Month Day HH:MM:SS Year”. The time zone changes according to the system that the hl-smi runs on.

Day-of-week Month Day HH:MM:SS Year”. Example: Wed Oct 21 15:38:16 IDT 2020

Name

The name of the Intel Gaudi board.

Alphanumeric string. Example: HL202

bus_id

The PCI address of the AIP.

Domain(DDDD)::bus number(BB)::device number(DD)::function number(FF). Example: 0000:01:00.0

driver_version

The host driver version.

Release number(XX.YY.ZZ)-SHA1-commit(XXXXXX). Example: 0.11.0-44980077

temperature.aip

Maximum temperature read from the four available temperature sensors on the SoC.

Maximum temperature in degrees Celsius. Example: 40 C

utilization.aip

Returns a simple utilization measurement which checks if any of the available HW components was busy over a period of 1000[msec]..

Utilization is given in percent format. Example: 80%

memory.total

The total size, free and used, memory available.

[Value] in MiB. Example: 32768 MiB

memory.free

Available size of unused memory.

[Value] in MiB. Example: 32256 MiB

memory.used

The size of used memory.

[Value] in MiB. Example: 512 MiB

module_id

Location of card in board.

Decimal string. Example: 0

index

Index of the AIP in the host.

A number (0,1,….). Example: 0

serial

The unique serial number of the AIP.

YYXXXXXXXX : 2 letters and 9 numbers. Example: AJ42038679

uuid

The universally unique 128-bit identifier of the SoC.

Alphanumeric string made up of table_version, device_ID, FAB#, LOT#, Wafer#, X Coordinate, Y Coordinate. Example: 00P1-HL2000B0-14-P63B83-04-08-10

power.draw

The total power consumption of the AIP device is calculated using current and voltage samples. Power draw is current multiplied by voltage.

[Value] W. Example: 40W

ecc.errors.uncorrected.aggregate.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error). The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.errors.uncorrected.volatile.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR (in current version it gets SERR instead, seems like a bug). The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected.

[Counter]. Example: 0

ecc.errors.corrected.aggregate.total

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.errors.corrected.volatile.total

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR (in current version it gets SERR instead, seems like a bug). The total volatile number of ECC errors is counted from the time a file descriptor is opened.

[Counter]. Example: 0

ecc.errors.dram.aggregate.total

Number of uncorrected HBM errors. The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.mode.current

The ECC mode that the AIP is currently operating under.

Enabled/Disabled

ecc.mode.pending

The ECC mode that the AIP will operate after the next reboot.

Enabled/Disabled

stats.violation.power

Duration of latest power related throttling event per device (ns).

XXXXX nsec

stats.violation.thermal

Duration of latest thermal related throttling event per device (ns).

XXXXX nsec

hl-smi Command Line Examples

The below is an example of the tool command line:

$ hl-smi –q # if invoked as root user shows all N/A fields

The following shows examples of the tool command line with csv format:

Example 1:

/hl-smi -Q timestamp,bus_id,memory.used -f csv
timestamp, bus_id, memory.used
Sun Oct 27 16:27:39 IST 201, 0000:01:00.0, 536870912 B

Example 2:

hl-smi -Q timestamp,memory.used,memory.free -f csv,nounits -l
timestamp, memory.free, memory.used
Sun Oct 27 16:30:32 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:37 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:42 IST 201, 3758096384, 536870912

Note

After running the query, copy the results and save in a file with .csv ending (results.csv). Open the file from “Excel” to view the table:

../../_images/excel_table.jpg

Example 3:

$ hl-smi topo -c -N
modID    CPU Affinity    NUMA Affinity
-----    ------------    -------------
0        0-27, 56-83      0
1        28-55, 84-111    1
2        28-55, 84-111    1
3        0-27, 56-83      0

Example 4:

$ hl-smi -n ports  -i  0000:08:00.0
port 0: external
port 1: internal
port 2: internal
port 3: internal
port 4: internal
port 5: internal
port 6: internal
port 7: internal
port 8: external
port 9: external

Example 5:

$ hl-smi --nic=link  -i  0000:08:00.0 -P 7,3,2
port 2: DOWN
port 3: UP
port 7: DOWN

Example 6:

$ hl-smi --query-row-replacement uuid,replaced_rows.address,replaced_rows.cause -f csv
uuid, address.hbm_idx, address.pc, address.sid, address.bank_idx, address.row_addr, replaced_rows.cause
00P3-HL2000B0-14-P63X13-03-03-07, 1, 14, 0, 0, 0, Single Bit ECC
00P3-HL2000B0-14-P63X13-03-03-07, 3, 10, 0, 0, 0, Double Bit ECC

Example 7:

    $ hl-smi -q -d ROW_REPLACEMENT

 ================ HL-SMI LOG ================

 Timestamp               : Mon Nov  1 12:13:16 IST 2021

 Driver Version              : 1.2.0-5502e1915
HL-SMI Version               : hl-1.1.0-fw-32.3.0-6-g20afbf28-dirty (Sep 12 2021 - 10:49:45)

 Attached AIPs               : 1

 [0] AIP (hl0) 0000:01:00.0
     Replaced Rows
         Single Bit ECC      : 1
         Double Bit ECC      : 1
         Pending             : Yes

Running hl-smi as Daemon

You can run hl-smi as daemon, hl-smi dmon, using -i or --device <pci-addr>:

Option

Description

-i, –device <pci-addr>

Open PTS for firmware of requested PCIe device

Running hl-smi with topo Option

You can run hl-smi with topo, hl-smi topo, using -c or -N:

Option

Description

-c, –cpu

Get CPU ideal affinity for device. Note: Relevant only with ‘topo’.

-N, –numa

Get NUMA affinity for device. Note: Relevant only with ‘topo’.

Example :

$ hl-smi topo -c -N
modID    CPU Affinity    NUMA Affinity
-----    ------------    -------------
0        0-27, 56-83      0
1        28-55, 84-111    1
2        28-55, 84-111    1
3        0-27, 56-83      0