6. System Management Interface Tool User Guide (hl-smi Tool)

6.1. Overview

This document describes the system management interface tool. This tool obtains information and monitors data of the device.

6.2. hl-smi Utility Options

Running hl-smi without an Options argument set displays a summary table of the detected Habana Labs devices.

$ hl-smi [<options>]

Use the -h argument to view all Options. The below table presents all the Options available and their description.

Option

Description

-L, –list-aips

Lists all Habana Labs devices.

-i, –device <pci addr>

Acts on a specific PCIe device.

-r, –reset-aip

Triggers a reset of the HABANALABS AIP. Requires root.

Requires -i switch to target specific device.

-p, –power-limit <watts>

Sets maximum power limit.

-q, –query

Displays information for AIPs in the system.

-d, –display <type>

Displays only one of the following selected information:

POWER, CLOCK, TEMPERATURE, FAN, BUS, MEMORY, NET.

-l, –loop [seconds]

Continuously reports query data at the specified interval. Default interval is 5 seconds.

-h, –help

Prints this help message.

-v, –version

Outputs version information and the termination of the utility.

-D

Adds debug prints.

-Q, –query-aip <types>

Selects queries to be presented in csv format.

Possible queries:

timestamp, name, bus_id, driver_version, temperature.aip, utilization.aip, memory.total, memory.free, memory.used, module_id, index, serial, uuid, power.draw, ecc.errors.uncorrected.aggregate.total, ecc.errors.uncorrected.volatile.total

-f, –format <types>

Displays the required information in csv format:

csv (mandatory), noheader, nounits

6.2.1. hl-smi -q Option

The following table describes –query-aip <types>:

Name

Description

Format

Timestamp

The timestamp at the time in which the hl-smi was invoked. The date and time format is “Day-of-week Month Day HH:MM:SS Year”. The time zone changes according to the system that the hl-smi runs on.

Day-of-week Month Day HH:MM:SS Year”. Example: Wed Oct 21 15:38:16 IDT 2020

Name

The name of the Habana Labs board.

Alphanumeric string. Example: HL202

bus_id

The PCI address of the AIP.

Domain(DDDD)::bus number(BB)::device number(DD)::function number(FF). Example: 0000:01:00.0

driver_version

The host driver version.

Release number(XX.YY.ZZ)-SHA1-commit(XXXXXX). Example: 0.11.0-44980077

temperature.aip

Maximum temperature read from the four available temperature sensors on the SoC.

Maximum temperature in degrees Celsius. Example: 40 C

utilization.aip

Returns a simple utilization measurement which checks if any of the available HW components was busy over a period of 1000[msec]..

Utilization is given in percent format. Example: 80%

memory.total

The total size, free and used, memory available.

[Value] in MiB. Example: 32768 MiB

memory.free

Available size of unused memory.

[Value] in MiB. Example: 32256 MiB

memory.used

The size of used memory.

[Value] in MiB. Example: 512 MiB

module_id

Location of card in board.

Decimal string. Example: 0

index

Index of the AIP in the host.

A number (0,1,….). Example: 0

serial

The unique serial number of the AIP.

YYXXXXXXXX : 2 letters and 9 numbers. Example: AJ42038679

uuid

The universally unique 128-bit identifier of the SoC.

Alphanumeric string made up of table_version, device_ID, FAB#, LOT#, Wafer#, X Coordinate, Y Coordinate. Example: 00P1-HL2000B0-14-P63B83-04-08-10

power.draw

The total power consumption of the AIP device is calculated using current and voltage samples. Power draw is current multiplied by voltage.

[Value] W. Example: 40W

ecc.errors. uncorrected.aggregate.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type SERR 9 single error). The total aggregated number of ECC errors is counted from the time the driver is loaded.

[Counter]. Example: 0

ecc.errors. uncorrected.volatile.total

Number of total ECC events for all modules (SRAM, TPC, MME, etc), which are of type DERR (in current version it gets SERR instead, seems like a bug). The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected.

[Counter]. Example: 0

6.2.2. hl-smi Command Line Examples

The below is an example of the tool command line:

$ hl-smi –q # if invoked as root user shows all N/A fields

The following shows examples of the tool command line with csv format:

Example 1:

/hl-smi -Q timestamp,bus_id,memory.used -f csv
timestamp, bus_id, memory.used
Sun Oct 27 16:27:39 IST 201, 0000:01:00.0, 536870912 B

Example 2:

hl-smi -Q timestamp,memory.used,memory.free -f csv,nounits -l
timestamp, memory.free, memory.used
Sun Oct 27 16:30:32 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:37 IST 201, 3758096384, 536870912
Sun Oct 27 16:30:42 IST 201, 3758096384, 536870912

Note

After running the query, copy the results and save in a file with .csv ending (results.csv). Open the file from “Excel” to view the table:

../_images/excel_table.jpg

6.3. Running hl-smi as Daemon

You can run hl-smi as daemon, hl-smi dmon, using -i or --device <pci-addr>:

Option

Description

-i, –device <pci-addr>

Open PTS for firmware of requested PCIe device