hl_smi_async Tool

hl_smi_async is a utility tool for monitoring and managing the Gaudi devices asynchronously. Its basic functionality is a telemetry data collection and logging the collected data. hl_smi_async allows reading the different telemetry packets and handling all the related configurations and handshakes. For more information about the telemetry packets, see the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC. The hl_smi_async application is located in the hl-smi repository: hl-smi/app/hl-smi-async.

Note

  • The tool can run with or without driver loaded.

  • It is only possible to run the tool on a bare-metal machine or a Hypervisor. Running the tool on a guest VM may cause undefined behavior.

Options and Usage

The following table lists the available hl_smi_async options and their usage to help you effectively configure the tool for your specific needs.

Example:

sudo /usr/sbin/hl-smi-async -D 0000:b1:00.0 -O console -L info -I 5

Option

Description

-h, --help

Outputs the help message and exits.

-V, --version

Outputs version information and exits.

-D, --device [device]

Specifies the PCIe address of the device (e.g. 0000:b1:00.0).

-O, --output [output]

Specifies the telemetry output type. Valid values:

  • console (default) - Prints all collected telemetry to the console.

  • logfile - Prints all collected telemetry to a logfile: telem_log.txt

-L, --loglevel [level]

Specifies the log level. Valid values:

  • info (default) - Log informative messages.

  • debug - Log debug info.

  • error - Log all errors.

-I, --iterations [number]

Specifies the number of iterations. If not set (default), the tool runs in an endless loop.

Output Examples

Starting telemetry data collection
Reading all synced telemetry packets
Timestamp: 23958 ms, Packet: Temperature, Field: temperature.aip, Data: 47
Timestamp: 24340 ms, Packet: Power, Field: power.draw.54v, Data: 191
Timestamp: 23988 ms, Packet: Health, Field: health, Data: 1
Timestamp: 24340 ms, Packet: Perf, Field: uptime, Data: 24
Timestamp: 24340 ms, Packet: Perf, Field: utilization.aip, Data: 0
Timestamp: 3835 ms, Packet: System Status, Field: ib_fw_update.stat, Data: 2
Timestamp: 3835 ms, Packet: System Status, Field: ethernet_ports.state, Data: 16777215
Timestamp: 3835 ms, Packet: System Status, Field: current_running_fw, Data: 1
Timestamp: 3835 ms, Packet: Security, Field: security.hash_spi_code, Data:
da a8 37 cd ed bc 7c 6f df 58 18 06 f9 86 05 4a d9 cf ac 8e 0d 4d 52 f1 da 82 eb 09 b6 69 06 26 f8 89 5c 52 c9 59 5f 13 3b 69 8c f0 c9 43 c8 51
Timestamp: 3842 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
Timestamp: 3842 ms, Packet: Mem_utility, Field: memory.total, Data: 131072
Timestamp: 3842 ms, Packet: Mem_utility, Field: memory.free, Data: 131072
Timestamp: 3842 ms, Packet: Mem_utility, Field: memory.used, Data: 0
Timestamp: 3835 ms, Packet: System Information, Field: fw_version.spi, Data: 1.21.1.0

The following table describes each field in the output above:

Packet Type

Field

Value

Unit

Description

Temperature

temperature.aip

47

Measured in Celsius. Example: 40 C

Single composite temperature measurement that provides maximal measurement of the SoC, HBM and VRM sensors (with alignment of all to the same threshold).

Power

power.draw.54v

188

[Value] W. Example: 40 W

The total power consumption of the AIP device drawn from the 54V.

Health

health

1

Values: 0 - Unknown, 1 - Normal, 2 - Non-critical, 3 - Critical, 4 - Fatal

Information on the device’s health with severity indication.

Performance

uptime

25

XXXXX sec

Uptime since last reset (any type of reset) counted from the OS bring-up, i.e. the count is restarted when moving from preboot to management FW.

Performance

utilization.aip

0

Percentage. Example: 80%

Returns a utilization measurement based on the consumed power out of the total power.

System Status

ib_fw_update.stat

2

Values: 0 - Unknown, 1 - Locked, 2 - Unlocked

IB FW update state: Locked or unlocked.

System Status

ethernet_ports.state

16777215

Bitmask. Refer to the description.

Bitmask field for the ethernet ports’ state: Per bit (according to its location): 1- Port is Enabled, 0- Port is Disabled. Bit 0 relates to port 1, bit 1 relates to port 2 and so on until bit 23 which relates to port 24. Since there are 24 ports, the 8 most significant bits are always set to 0. By default, all valid ports are enabled (0x00FF_FFFF). The value can be changed using the related effecter via OOB or using IB FW update with the relevant ITBs. The value is currently represented in decimal format. Make sure to convert it to hexadecimal to correctly interpret and utilize it as a bitmask.

System Status

current_running_fw

1

Refer to the description.

Running FW states: 0x1 = preBoot, 0x2 = management FW, 0x3 = preBoot recovery, 0x4 = Margin tool, 0x5 = FW loader agent.

Security

security.hash_spi_code

da a8 37 cd …

SHA384

SHA384 for the Manifest of SPI code (ppBoot and preBoot).

Mem_utility

utilization.memory

0

Percentage of used memory.

HBM memory usage (including reserved) measured over 1000ms.

Mem_utility

memory.total

131072

[Value] in MiB. Example: 32768 MiB

Total memory available (free + used).

Mem_utility

memory.free

131072

[Value] in MiB. Example: 32256 MiB

Free memory size.

Mem_utility

memory.used

0

[Value] in MiB. Example: 512 MiB

Used memory size.

System Information

fw_version.spi

1.21.1.0

Displayed in the following format Major.Minor.Update.Reserved

OS firmware version running on the system.

Note

Some fields are invalid when the driver is not up (preboot mode). For more information, refer to the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC.

Starting telemetry data collection
Reading all synced telemetry packets
Timestamp: 36115 ms, Packet: Temperature, Field: temperature.aip, Data: 45
Timestamp: 35909 ms, Packet: Power, Field: power.draw.54v, Data: 178
Timestamp: 35909 ms, Packet: Power, Field: power.draw.12v, Data: 14
Timestamp: 35219 ms, Packet: Health, Field: health, Data: 1
Timestamp: 35909 ms, Packet: Perf, Field: uptime, Data: 35
Timestamp: 35909 ms, Packet: Perf, Field: utilization.aip, Data: 0
Timestamp: 35909 ms, Packet: Perf, Field: stats.violation.power, Data: 0
Timestamp: 35909 ms, Packet: Perf, Field: stats.violation.thermal, Data: 0
Timestamp: 905 ms, Packet: System Status, Field: ib_fw_update.stat, Data: 2
Timestamp: 905 ms, Packet: System Status, Field: ethernet_ports.state, Data: 16777215
Timestamp: 905 ms, Packet: System Status, Field: current_running_fw, Data: 2
Timestamp: 906 ms, Packet: Security, Field: security.hash_spi_code, Data:
da a8 37 cd ed bc 7c 6f df 58 18 06 f9 86 05 4a d9 cf ac 8e 0d 4d 52 f1 da 82 eb 09 b6 69 06 26 f8 89 5c 52 c9 59 5f 13 3b 69 8c f0 c9 43 c8 51
Timestamp: 906 ms, Packet: Security, Field: security.hash_boot_fit, Data:
b9 c0 f4 2e 40 77 ca 3f 21 52 fe 55 7b a0 3a 8c 47 ce cf 42 17 ad 98 e7 c8 98 e1 cd 0d d0 ff f6 a2 0e f5 98 c0 9c b5 61 b4 5c 51 e3 c2 84 d2 f7
Timestamp: 1270 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
Timestamp: 1270 ms, Packet: Mem_utility, Field: memory.total, Data: 131072
Timestamp: 1270 ms, Packet: Mem_utility, Field: memory.free, Data: 130400
Timestamp: 1270 ms, Packet: Mem_utility, Field: memory.used, Data: 672
Timestamp: 1084 ms, Packet: System Information, Field: fw_version.spi, Data: 1.21.1.0
Timestamp: 1084 ms, Packet: System Information, Field: fw_version.os, Data: 1.21.1.0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.uncorrected.aggregate.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.uncorrected.volatile.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.corrected.aggregate.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.corrected.volatile.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.dram.aggregate.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.mode.current, Data: 1
Timestamp: 35909 ms, Packet: Errors, Field: ecc.mode.pending, Data: 1

The following table describes each field in the output above:

Packet Type

Field

Value

Unit

Description

Temperature

temperature.aip

45

Degrees Celsius. Example: 40 C

Single composite temperature measurement that provides maximal measurement of the SoC, HBM and VRM sensors (with alignment of all to the same threshold).

Power

power.draw.54v

178

[Value] W. Example: 40 W

The total power consumption of the AIP device drawn from the 54V.

Power

power.draw.12v

14

[Value] W. Example: 40 W

The total power consumption of the AIP device drawn from the 12V.

Health

health

1

Values: 0 - Unknown, 1 - Normal, 2 - Non-critical, 3 - Critical, 4 - Fatal

Information on the device’s health with severity indication.

Performance

uptime

35

XXXXX sec

Uptime since last reset (any type of reset) counted from the OS bring-up, i.e. the count is restarted when moving from preboot to management FW.

Performance

utilization.aip

0

Percentage. Example: 80%

Returns a utilization measurement based on the consumed power out of the total power.

Performance

stats.violation.power

0

XXXXX nsec

Duration of latest power-related throttling event per device (ns). This is the actual duration in which PID applied throttling. Internally, it is counted in ms, returned in ns by request.

Performance

stats.violation.thermal

0

XXXXX nsec

Duration of latest thermal-related throttling event per device (ns). Internally counted in ms, returned in ns by request.

System Status

ib_fw_update.stat

2

Values: 0 - Unknown, 1 - Locked, 2 - Unlocked

IB FW update state: Locked or unlocked.

System Status

ethernet_ports.state

16777215

Bitmask. Refer to the description.

Bitmask field for the ethernet ports’ state: Per bit (according to its location): 1- Port is Enabled, 0- Port is Disabled. Bit 0 relates to port 1, bit 1 relates to port 2 and so on until bit 23 which relates to port 24. Since there are 24 ports, the 8 most significant bits are always set to 0. By default, all valid ports are enabled (0x00FF_FFFF). The value can be changed using the related effecter via OOB or using IB FW update with the relevant ITBs. The value is currently represented in decimal format. Make sure to convert it to hexadecimal to correctly interpret and utilize it as a bitmask.

System Status

current_running_fw

2

Refer to the description.

Running FW: 0x1 is preBoot, 0x2 is management FW, 0x3 is preBoot recovery, 0x4 is Margin tool, 0x5 is FW loader agent.

Security

security.hash_spi_code

da a8 37 cd …

SHA384

SHA384 for the Manifest of SPI code (ppBoot and preBoot).

Security

security.hash_boot_fit

b9 c0 f4 2e …

SHA384

SHA384 for the boot fit.

Mem_utility

utilization.memory

0

Percentage of used memory.

HBM memory usage including memory reserved by driver/FW over 1000ms.

Mem_utility

memory.total

131072

[Value] in MiB. Example: 32768 MiB

Total size of available memory (free + used).

Mem_utility

memory.free

130400

[Value] in MiB. Example: 32256 MiB

Size of free memory.

Mem_utility

memory.used

672

[Value] in MiB. Example: 512 MiB

Size of used memory.

System Information

fw_version.spi

1.21.1.0

Displayed in the following format Major.Minor.Update.Reserved

The version of the preboot which is stored in the SPI flash.

System Information

fw_version.os

1.21.1.0

Displayed in the following format Major.Minor.Update.Reserved.

The version of the arcmgmt, and is available only when the driver is up.

Errors

ecc.errors.uncorrected.aggregate.total

0

[Counter]. Example: 0

Number of total uncorrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type DERR (double ECC error). The total aggregated number of ECC errors is counted from the time the driver is loaded.

Errors

ecc.errors.uncorrected.volatile.total

0

[Counter]. Example: 0

Number of total uncorrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type DERR (double ECC error). The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected.

Errors

ecc.errors.corrected.aggregate.total

0

[Counter]. Example: 0

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type SERR (single ECC error). The total aggregated number of ECC errors is counted from the time the driver is loaded.

Errors

ecc.errors.corrected.volatile.total

0

[Counter]. Example: 0

Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type SERR (single ECC error). The total volatile number of ECC errors is counted from the time a file descriptor is opened.

Errors

ecc.errors.dram.aggregate.total

0

[Counter]. Example: 0

Number of uncorrected HBM errors. The total aggregated number of ECC errors is counted from the time the driver is loaded.

Errors

ecc.mode.current

1

1: Enabled, 0: Disabled

The ECC mode that the AIP is currently operating under

Errors

ecc.mode.pending

1

1: Enabled, 0: Disabled

The ECC mode that the AIP will operate on after the next reboot.

Validating FW Images Authenticity

This section explains how to validate the authenticity of firmware images using the hl-smi-async tool by comparing its output with the corresponding SHA files included in the habanalabs-hypervisor-utils package:

  • img-hash-gaudi3-boot-fit.sha384

  • img-hash-gaudi3-images-pointers.bin.be.sha384

After downloading and installing habanalabs-hypervisor-utils package which includes hl-smi-async tool, as described in Installing Hypervisor Tools Package section, the SHA files will be located in /lib/firmware/habanalabs/gaudi3.

To retrieve the hashes of the latest FW version, perform the following:

  1. Upgrade FW version to the latest SPI flash version as described in FW_Upgrade_Sec section.

  2. Load the LKD driver on the VM.

  3. Run the hl-smi-async utility on the hypervisor:

    sudo /usr/sbin/hl-smi-async -D <pci_addr> -O console -L info -I 1
    
  4. Using a command line utility (such as xxd), compare the outputs of the security.hash_spi_code and security.hash_boot_fit hashes with the output of the hl-smi-async tool. security.hash_spi_code hash is the signature of the flash image, while security.hash_boot_fit hash is the signature of the FW application (mgmt app) which is loaded directly into RAM. The hash_spi_code is accessible during both the preboot and FW application run stages, while the hash_boot_fit is only available during running the FW application after the LKD driver has been loaded. Therefore, prior to running the LKD driver, only the hash_spi_code is displayed, while the hash_boot_fit is shown only after the LKD is running. See the examples below. The values displayed in the outputs vary depending on the release build number in use:

    In the file output:

    $ xxd /lib/firmware/habanalabs/gaudi3/img-hash-gaudi3-images-pointers.bin.be.sha384
    00000000: 5c49 ed19 051e d6a6 0381 13d0 7487 dce4  \I..........t...
    00000010: 906b bb74 ee06 3569 f67a 6906 eb8a c8a9  .k.t..5i.zi.....
    00000020: abbd 0c32 0e59 1e55 48fe aa8d 8732 a6e1  ...2.Y.UH....2..
    

    Look for the following from hl-smi-async tool output:

    Security, Field: security.hash_spi_code, Data:
    5c 49 ed 19 05 1e d6 a6 03 81 13 d0 74 87 dc e4 90 6b bb 74 ee 06 35 69 f6 7a 69 06 eb 8a c8 a9 ab bd 0c 32 0e 59 1e 55 48 fe aa 8d 87 32 a6 e1
    

    In the file output:

    $ xxd /lib/firmware/habanalabs/gaudi3/img-hash-gaudi3-boot-fit.sha384
    00000000: 0d8d d302 3aea d1d1 8821 030f 404d bf98  ....:....!..@M..
    00000010: a05c 52b1 3e6e 16c4 1678 0cc7 b195 8c42  .\R.>n...x.....B
    00000020: 6675 e039 9adf 4dfb 40d8 f622 3b11 d46c  fu.9..M.@..";..l
    

    Look for the following from hl-smi-async tool output:

    Security, Field: security.hash_boot_fit, Data:
    0d 8d d3 02 3a ea d1 d1 88 21 03 0f 40 4d bf 98 a0 5c 52 b1 3e 6e 16 c4 16 78 0c c7 b1 95 8c 42 66 75 e0 39 9a df 4d fb 40 d8 f6 22 3b 11 d4 6c