System Management Interface Tool (hl-smi)
On this Page
System Management Interface Tool (hl-smi
)¶
This section describes the system management interface tool, hl-smi
. This tool obtains information and monitors data of the Intel® Gaudi® AI accelerators.
hl-smi
Utility Options¶
Note
Run the sudo hl-smi
command at root (not in a Docker) to be able to see the Process IDs of all users on a system.
Running hl-smi
without an Options argument set displays a summary table of the detected Gaudi devices.
$ hl-smi [<options>]
Use the -h
argument to view all Options.
The below table presents all the Options available and their description.
Option |
Description |
---|---|
-L, –list-aips |
Lists all Gaudi devices. |
-i, –device <pci addr> |
Acts on a specific PCIe device. |
–fw-versions |
Displays versions of the FW components. |
-r, –reset-aip |
Triggers a reset of the Intel Gaudi AIP. Requires root. Requires |
-p, –power-max <watts> |
Sets maximum power limit. For Gaudi 3:
|
-q, –query |
Displays information for AIPs in the system. |
-d, –display <type> |
Displays only one of the following selected information: PRODUCT, POWER, CLOCK, TEMPERATURE, FAN, BUS,MEMORY, NET, ROW_REPLACEMENT. |
-l, –loop [seconds] |
Continuously reports query data at the specified interval. The default interval is 5 seconds. |
-h, –help |
Prints the help message. |
-v, –version |
Outputs version information and the termination of the utility. |
-D |
Adds debug prints. |
-Q, –query-aip <types> |
Selects queries to be presented in CSV format. Possible queries: timestamp, name, bus_id, driver_version, temperature.aip, module_id, utilization.aip, memory.total, memory.free, memory.used, index, serial, uuid, power.draw, ecc.errors.uncorrected.aggregate.total, ecc.errors.uncorrected.volatile.total, ecc.errors.corrected.aggregate.total, ecc.errors.corrected.volatile.total, ecc.errors.dram.aggregate.total, ecc.errors.dram-corrected.aggregate.total, ecc.errors.dram.volatile.total, ecc.errors.dram-corrected.volatile.total, ecc.mode.current, ecc.mode.pending, stats.violation.power, stats.violation.thermal. clocks.current.soc, clocks.max.soc clocks.limit.soc, clocks.limit.tpc pcie.link.gen.max, pcie.link.gen.current, pcie.link.width.max, pcie.link.width.current, pcie.link.speed.max, pcie.link.speed.current, ecc.errors.hbm.sram.critical. |
–query-row-replacement <types> |
Displays row replacement information according to type selections: uuid, replaced_rows.address, replaced_rows.cause. |
-f, –format <types> |
Displays the required information in CSV format: csv (mandatory), noheader, nounits |
-n, –nic <ports|link|stats> |
Retrieves NIC port information (internal/
external), including link state (up/down)
or statistics for the requested internal
port(s). The |
-P, –port [ports] |
Specifies the selected ports (optional) for NIC information. If no ports are specified, data will be retrieved for all ports. |
-t, –tpm |
Provides data for further TPM validation. The output includes:
|
hl-smi
-Q Option¶
The following table describes --query-aip <types>
:
Name |
Description |
Format |
---|---|---|
Timestamp |
The timestamp at the time in which the |
Day-of-week Month Day HH:MM:SS Year”. Example: Wed Oct 21 15:38:16 IDT 2020 |
Name |
The name of the Intel Gaudi board. |
Alphanumeric string. Example: HL202 |
bus_id |
The PCI address of the AIP. |
Domain(DDDD)::bus number(BB)::device number(DD)::function number(FF). Example: 0000:01:00.0 |
driver_version |
The host driver version. |
Release number(XX.YY.ZZ)-SHA1-commit(XXXXXX). Example: 0.11.0-44980077 |
temperature.aip |
Maximum temperature read from the four available temperature sensors on the SoC. |
Maximum temperature in degrees Celsius. Example: 40 C |
utilization.aip |
Returns a simple utilization measurement which checks if any of the available HW components was busy over a period of 1000[msec].. |
Utilization is given in percent format. Example: 80% |
memory.total |
The total size of available memory including free and used memory. |
[Value] in MiB. Example: 32768 MiB |
memory.free |
Available size of unused memory. |
[Value] in MiB. Example: 32256 MiB |
memory.used |
The size of used memory. |
[Value] in MiB. Example: 512 MiB |
module_id |
Location of card in board. |
Decimal string. Example: 0 |
index |
Index of the AIP in the host. |
A number (0,1,….). Example: 0 |
serial |
The unique serial number of the AIP. |
YYXXXXXXXX : 2 letters and 9 numbers. Example: AJ42038679 |
uuid |
The universally unique 128-bit identifier of the SoC. |
Alphanumeric string made up of table_version, device_ID, FAB#, LOT#, Wafer#, X Coordinate, Y Coordinate. Example: 00P1-HL2000B0-14-P63B83-04-08-10 |
power.draw |
The total power consumption of the AIP device is calculated using current and voltage samples. Power draw is current multiplied by voltage. The reported power data reflects the power consumption of the 54V power rail. |
[Value] W. Example: 40W |
ecc.errors.uncorrected.aggregate.total |
Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded. |
[Counter]. Example: 0 |
ecc.errors.uncorrected.volatile.total |
Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR. The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected. |
[Counter]. Example: 0 |
ecc.errors.corrected.aggregate.total |
Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded. |
[Counter]. Example: 0 |
ecc.errors.corrected.volatile.total |
Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR. The total volatile number of ECC errors is counted from the time a file descriptor is opened. |
[Counter]. Example: 0 |
ecc.errors.dram.aggregate.total |
Number of uncorrected HBM errors. The total aggregated number of ECC errors is counted from the time the driver is loaded. |
[Counter]. Example: 0 |
ecc.mode.current |
The ECC mode that the AIP is currently operating under. |
Enabled/Disabled |
ecc.mode.pending |
The ECC mode that the AIP will operate on after the next reboot. |
Enabled/Disabled |
stats.violation.power |
Duration of latest power related throttling event per device (ns). |
XXXXX nsec |
stats.violation.thermal |
Duration of latest thermal related throttling event per device (ns). |
XXXXX nsec |
Note
Gaudi 3 and Gaudi 2 have two power rails: 54V and 12V. The power.draw
reported by hl-smi -Q
reflects
the power consumption of the 54V power rail only. The power consumption (Pwr: Usage/Cap
) reported
by hl-smi
includes the combined power usage of both the 54V and 12V power rails. Therefore,
the power data from the hl-smi -Q
command with the power.draw
query is typically lower
than that obtained by running the hl-smi
command.
Running hl-smi
as Daemon¶
You can run hl-smi
as daemon, hl-smi dmon
, using -i
or --device <pci-addr>
:
Option |
Description |
---|---|
-i, –device <pci-addr> |
Open PTS for firmware of requested PCIe device |
Running hl-smi
with topo Option¶
You can run hl-smi
with topo, hl-smi topo
, using -c
or -N
:
Option |
Description |
---|---|
-c, –cpu |
Get CPU ideal affinity for device. Note: Relevant only with ‘topo’. |
-N, –numa |
Get NUMA affinity for device. Note: Relevant only with ‘topo’. |
Example:
$ hl-smi topo -c -N
modID CPU Affinity NUMA Affinity
----- ------------ -------------
0 0-27, 56-83 0
1 28-55, 84-111 1
2 28-55, 84-111 1
3 0-27, 56-83 0
hl-smi
Command Line Examples¶
The below is an example of the tool command line:
$ hl-smi –q # if invoked as root user shows all N/A fields
The following shows examples of the tool command line with csv format:
Example 1:
/hl-smi -Q timestamp,bus_id,memory.used -f csv
timestamp, bus_id, memory.used
Sun Oct 27 16:27:39 IST 201, 0000:01:00.0, 536870912 B
Example 2:
hl-smi -Q timestamp,memory.used,memory.free -f csv,nounits -l
timestamp, memory.free, memory.used
Sun Oct 27 16:30:32 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:37 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:42 IST 201, 3758096384, 536870912
Note
After running the query, copy the results and save in a file with .csv ending (results.csv). Open the file from “Excel” to view the table:
Example 3:
$ hl-smi topo -c -N
modID CPU Affinity NUMA Affinity
----- ------------ -------------
0 0-27, 56-83 0
1 28-55, 84-111 1
2 28-55, 84-111 1
3 0-27, 56-83 0
Example 4:
$ hl-smi -n ports -i 0000:08:00.0
port 0: external
port 1: internal
port 2: internal
port 3: internal
port 4: internal
port 5: internal
port 6: internal
port 7: internal
port 8: external
port 9: external
Example 5:
$ hl-smi --nic=link -i 0000:08:00.0 -P 7,3,2
port 2: DOWN
port 3: UP
port 7: DOWN
Example 6:
$ hl-smi --query-row-replacement uuid,replaced_rows.address,replaced_rows.cause -f csv
uuid, address.hbm_idx, address.pc, address.sid, address.bank_idx, address.row_addr, replaced_rows.cause
00P3-HL2000B0-14-P63X13-03-03-07, 1, 14, 0, 0, 0, Single Bit ECC
00P3-HL2000B0-14-P63X13-03-03-07, 3, 10, 0, 0, 0, Double Bit ECC
Example 7:
$ hl-smi -q -d ROW_REPLACEMENT
================ HL-SMI LOG ================
Timestamp : Mon Nov 1 12:13:16 IST 2021
Driver Version : 1.2.0-5502e1915
HL-SMI Version : hl-1.1.0-fw-32.3.0-6-g20afbf28-dirty (Sep 12 2021 - 10:49:45)
Attached AIPs : 1
[0] AIP (hl0) 0000:01:00.0
Replaced Rows
Single Bit ECC : 1
Double Bit ECC : 1
Pending : Yes