System Management Interface Tool User Guide (hl-smi Tool)
On this Page
System Management Interface Tool User Guide (hl-smi Tool)¶
This document describes the system management interface tool. This tool obtains information and monitors data of the device.
hl-smi Utility Options¶
Note
Run the sudo hl-smi
command at root (not in a Docker) to be able to see the Process IDs of all users on a system.
Running hl-smi without an Options argument set displays a summary table of the detected Habana Labs devices.
$ hl-smi [<options>]
Use the -h
argument to view all Options.
The below table presents all the Options available and their description.
Option |
Description |
---|---|
-L, –list-aips |
Lists all Habana Labs devices. |
-i, –device <pci addr> |
Acts on a specific PCIe device. |
-r, –reset-aip |
Triggers a reset of the HABANALABS AIP. Requires root. Requires -i switch to target specific device. |
-p, –power-limit <watts> |
Sets maximum power limit. |
-q, –query |
Displays information for AIPs in the system. |
-d, –display <type> |
Displays only one of the following selected information: PRODUCT, POWER, CLOCK, TEMPERATURE, FAN, BUS,MEMORY, NET, ROW_REPLACEMENT. |
-l, –loop [seconds] |
Continuously, reports query data at the specified interval. The default interval is 5 seconds. |
-h, –help |
Prints this help message. |
-v, –version |
Outputs version information and the termination of the utility. |
-D |
Adds debug prints. |
-Q, –query-aip <types> |
Selects queries to be presented in csv format. Possible queries: timestamp, name, bus_id, driver_version, temperature.aip, module_id, utilization.aip, memory.total, memory.free, memory.used, index, serial, uuid, power.draw, ecc.errors.uncorrected.aggregate.total, ecc.errors.uncorrected.volatile.total, ecc.errors.corrected.aggregate.total, ecc.errors.corrected.volatile.total, ecc.errors.dram.aggregate.total, ecc.mode.current, ecc.mode.pending, stats.violation.power, stats.violation.thermal. |
–query-row-replacement <types> |
Displays row replacement info according to type selections: uuid, replaced_rows.address, replaced_rows.cause. |
-f, –format <types> |
Displays the required information in csv format: csv (mandatory), noheader, nounits |
-n, –nic <ports|link|stats> |
Gets NIC ports info (internal/external) OR link state (up/down) OR statistics for the requested internal port(s). ‘nic stats’ is the equivalent for: “ethtool -S <port>” |
-P, –port [ports] |
Selected ports (optional) for NIC info. In case ports are not specified, data will be retrieved for all ports. |
hl-smi -Q Option¶
The following table describes –query-aip <types>:
Name |
Description |
Format |
---|---|---|
Timestamp |
The timestamp at the time in which the hl-smi was invoked. The date and time format is “Day-of-week Month Day HH:MM:SS Year”. The time zone changes according to the system that the hl-smi runs on. |
Day-of-week Month Day HH:MM:SS Year”. Example: Wed Oct 21 15:38:16 IDT 2020 |
Name |
The name of the Habana Labs board. |
Alphanumeric string. Example: HL202 |
bus_id |
The PCI address of the AIP. |
Domain(DDDD)::bus number(BB)::device number(DD)::function number(FF). Example: 0000:01:00.0 |
driver_version |
The host driver version. |
Release number(XX.YY.ZZ)-SHA1-commit(XXXXXX). Example: 0.11.0-44980077 |
temperature.aip |
Maximum temperature read from the four available temperature sensors on the SoC. |
Maximum temperature in degrees Celsius. Example: 40 C |
utilization.aip |
Returns a simple utilization measurement which checks if any of the available HW components was busy over a period of 1000[msec].. |
Utilization is given in percent format. Example: 80% |
memory.total |
The total size, free and used, memory available. |
[Value] in MiB. Example: 32768 MiB |
memory.free |
Available size of unused memory. |
[Value] in MiB. Example: 32256 MiB |
memory.used |
The size of used memory. |
[Value] in MiB. Example: 512 MiB |
module_id |
Location of card in board. |
Decimal string. Example: 0 |
index |
Index of the AIP in the host. |
A number (0,1,….). Example: 0 |
serial |
The unique serial number of the AIP. |
YYXXXXXXXX : 2 letters and 9 numbers. Example: AJ42038679 |
uuid |
The universally unique 128-bit identifier of the SoC. |
Alphanumeric string made up of table_version, device_ID, FAB#, LOT#, Wafer#, X Coordinate, Y Coordinate. Example: 00P1-HL2000B0-14-P63B83-04-08-10 |
power.draw |
The total power consumption of the AIP device is calculated using current and voltage samples. Power draw is current multiplied by voltage. |
[Value] W. Example: 40W |
ecc.errors.uncorrected.aggregate.total |
Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error). The total aggregated number of ECC errors is counted from the time the driver is loaded. |
[Counter]. Example: 0 |
ecc.errors.uncorrected.volatile.total |
Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR (in current version it gets SERR instead, seems like a bug). The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected. |
[Counter]. Example: 0 |
ecc.errors.corrected.aggregate.total |
Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error. The total aggregated number of ECC errors is counted from the time the driver is loaded. |
[Counter]. Example: 0 |
ecc.errors.corrected.volatile.total |
Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR (in current version it gets SERR instead, seems like a bug). The total volatile number of ECC errors is counted from the time a file descriptor is opened. |
[Counter]. Example: 0 |
ecc.errors.dram.aggregate.total |
Number of uncorrected HBM errors. The total aggregated number of ECC errors is counted from the time the driver is loaded. |
[Counter]. Example: 0 |
ecc.mode.current |
The ECC mode that the AIP is currently operating under. |
Enabled/Disabled |
ecc.mode.pending |
The ECC mode that the AIP will operate after the next reboot. |
Enabled/Disabled |
stats.violation.power |
Duration of latest power related throttling event per device (ns). |
XXXXX nsec |
stats.violation.thermal |
Duration of latest thermal related throttling event per device (ns). |
XXXXX nsec |
hl-smi Command Line Examples¶
The below is an example of the tool command line:
$ hl-smi –q # if invoked as root user shows all N/A fields
The following shows examples of the tool command line with csv format:
Example 1:
/hl-smi -Q timestamp,bus_id,memory.used -f csv
timestamp, bus_id, memory.used
Sun Oct 27 16:27:39 IST 201, 0000:01:00.0, 536870912 B
Example 2:
hl-smi -Q timestamp,memory.used,memory.free -f csv,nounits -l
timestamp, memory.free, memory.used
Sun Oct 27 16:30:32 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:37 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:42 IST 201, 3758096384, 536870912
Note
After running the query, copy the results and save in a file with .csv ending (results.csv). Open the file from “Excel” to view the table:
Example 3:
$ hl-smi -n ports -i 0000:08:00.0
port 0: external
port 1: internal
port 2: internal
port 3: internal
port 4: internal
port 5: internal
port 6: internal
port 7: internal
port 8: external
port 9: external
Example 4:
$ hl-smi --nic=link -i 0000:08:00.0 -P 7,3,2
port 2: DOWN
port 3: UP
port 7: DOWN
Example 5:
$ hl-smi --query-row-replacement uuid,replaced_rows.address,replaced_rows.cause -f csv
uuid, address.hbm_idx, address.pc, address.sid, address.bank_idx, address.row_addr, replaced_rows.cause
00P3-HL2000B0-14-P63X13-03-03-07, 1, 14, 0, 0, 0, Single Bit ECC
00P3-HL2000B0-14-P63X13-03-03-07, 3, 10, 0, 0, 0, Double Bit ECC
Example 6:
$ hl-smi -q -d ROW_REPLACEMENT
================ HL-SMI LOG ================
Timestamp : Mon Nov 1 12:13:16 IST 2021
Driver Version : 1.2.0-5502e1915
HL-SMI Version : hl-1.1.0-fw-32.3.0-6-g20afbf28-dirty (Sep 12 2021 - 10:49:45)
Attached AIPs : 1
[0] AIP (hl0) 0000:01:00.0
Replaced Rows
Single Bit ECC : 1
Double Bit ECC : 1
Pending : Yes
Running hl-smi as Daemon¶
You can run hl-smi as daemon, hl-smi dmon, using -i
or --device <pci-addr>
:
Option |
Description |
---|---|
-i, –device <pci-addr> |
Open PTS for firmware of requested PCIe device |
Running hl-smi with topo Option¶
You can run hl-smi with topo, hl-smi topo, using -c
or -N
:
Option |
Description |
---|---|
-c, –cpu |
Get CPU ideal affinity for device. Note: Relevant only with ‘topo’. |
-N, –numa |
Get NUMA affinity for device. Note: Relevant only with ‘topo’. |
Example :
$ hl-smi topo -c -N
modID CPU Affinity NUMA Affinity
----- ------------ -------------
0 0-27, 56-83 0
1 28-55, 84-111 1
2 28-55, 84-111 1
3 0-27, 56-83 0