hl_smi_async Tool
On this Page
hl_smi_async Tool¶
hl_smi_async is a utility tool for monitoring and managing the Gaudi devices asynchronously.
Its basic functionality is a telemetry data collection and logging the collected data. hl_smi_async allows reading
the different telemetry packets and handling all the related configurations and handshakes. For more information
about the telemetry packets, see the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC.
The hl_smi_async application is located in the hl-smi repository: hl-smi/app/hl-smi-async
.
Note
The tool can run with or without driver loaded.
It is only possible to run the tool on a bare-metal machine or a Hypervisor. Running the tool on a guest VM may cause undefined behavior.
Options and Usage¶
The following table lists the available hl_smi_async options and their usage to help you effectively configure the tool for your specific needs.
Example:
sudo /usr/sbin/hl-smi-async -D 0000:b1:00.0 -O console -L info -I 5
Option |
Description |
---|---|
|
Outputs the help message and exits. |
|
Outputs version information and exits. |
|
Specifies the PCIe address of the device (e.g. 0000:b1:00.0). |
|
Specifies the telemetry output type. Valid values:
|
|
Specifies the log level. Valid values:
|
|
Specifies the number of iterations. If not set (default), the tool runs in an endless loop. |
Output Examples¶
Starting telemetry data collection
Reading all synced telemetry packets
Timestamp: 23958 ms, Packet: Temperature, Field: temperature.aip, Data: 47
Timestamp: 24340 ms, Packet: Power, Field: power.draw.54v, Data: 191
Timestamp: 23988 ms, Packet: Health, Field: health, Data: 1
Timestamp: 24340 ms, Packet: Perf, Field: uptime, Data: 24
Timestamp: 24340 ms, Packet: Perf, Field: utilization.aip, Data: 0
Timestamp: 3835 ms, Packet: System Status, Field: ib_fw_update.stat, Data: 2
Timestamp: 3835 ms, Packet: System Status, Field: ethernet_ports.state, Data: 16777215
Timestamp: 3835 ms, Packet: System Status, Field: current_running_fw, Data: 1
Timestamp: 3835 ms, Packet: Security, Field: security.hash_spi_code, Data:
da a8 37 cd ed bc 7c 6f df 58 18 06 f9 86 05 4a d9 cf ac 8e 0d 4d 52 f1 da 82 eb 09 b6 69 06 26 f8 89 5c 52 c9 59 5f 13 3b 69 8c f0 c9 43 c8 51
Timestamp: 3842 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
Timestamp: 3842 ms, Packet: Mem_utility, Field: memory.total, Data: 131072
Timestamp: 3842 ms, Packet: Mem_utility, Field: memory.free, Data: 131072
Timestamp: 3842 ms, Packet: Mem_utility, Field: memory.used, Data: 0
Timestamp: 3835 ms, Packet: System Information, Field: fw_version.spi, Data: 1.21.1.0
The following table describes each field in the output above:
Packet Type |
Field |
Value |
Unit |
Description |
---|---|---|---|---|
Temperature |
temperature.aip |
47 |
Measured in Celsius. Example: 40 C |
Single composite temperature measurement that provides maximal measurement of the SoC, HBM and VRM sensors (with alignment of all to the same threshold). |
Power |
power.draw.54v |
188 |
[Value] W. Example: 40 W |
The total power consumption of the AIP device drawn from the 54V. |
Health |
health |
1 |
Values: 0 - Unknown, 1 - Normal, 2 - Non-critical, 3 - Critical, 4 - Fatal |
Information on the device’s health with severity indication. |
Performance |
uptime |
25 |
XXXXX sec |
Uptime since last reset (any type of reset) counted from the OS bring-up, i.e. the count is restarted when moving from preboot to management FW. |
Performance |
utilization.aip |
0 |
Percentage. Example: 80% |
Returns a utilization measurement based on the consumed power out of the total power. |
System Status |
ib_fw_update.stat |
2 |
Values: 0 - Unknown, 1 - Locked, 2 - Unlocked |
IB FW update state: Locked or unlocked. |
System Status |
ethernet_ports.state |
16777215 |
Bitmask. Refer to the description. |
Bitmask field for the ethernet ports’ state: Per bit (according to its location): 1- Port is Enabled, 0- Port is Disabled. Bit 0 relates to port 1, bit 1 relates to port 2 and so on until bit 23 which relates to port 24. Since there are 24 ports, the 8 most significant bits are always set to 0. By default, all valid ports are enabled (0x00FF_FFFF). The value can be changed using the related effecter via OOB or using IB FW update with the relevant ITBs. The value is currently represented in decimal format. Make sure to convert it to hexadecimal to correctly interpret and utilize it as a bitmask. |
System Status |
current_running_fw |
1 |
Refer to the description. |
Running FW states: 0x1 = preBoot, 0x2 = management FW, 0x3 = preBoot recovery, 0x4 = Margin tool, 0x5 = FW loader agent. |
Security |
security.hash_spi_code |
da a8 37 cd … |
SHA384 |
SHA384 for the Manifest of SPI code (ppBoot and preBoot). |
Mem_utility |
utilization.memory |
0 |
Percentage of used memory. |
HBM memory usage (including reserved) measured over 1000ms. |
Mem_utility |
memory.total |
131072 |
[Value] in MiB. Example: 32768 MiB |
Total memory available (free + used). |
Mem_utility |
memory.free |
131072 |
[Value] in MiB. Example: 32256 MiB |
Free memory size. |
Mem_utility |
memory.used |
0 |
[Value] in MiB. Example: 512 MiB |
Used memory size. |
System Information |
fw_version.spi |
1.21.1.0 |
Displayed in the following format |
OS firmware version running on the system. |
Note
Some fields are invalid when the driver is not up (preboot mode). For more information, refer to the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC.
Starting telemetry data collection
Reading all synced telemetry packets
Timestamp: 36115 ms, Packet: Temperature, Field: temperature.aip, Data: 45
Timestamp: 35909 ms, Packet: Power, Field: power.draw.54v, Data: 178
Timestamp: 35909 ms, Packet: Power, Field: power.draw.12v, Data: 14
Timestamp: 35219 ms, Packet: Health, Field: health, Data: 1
Timestamp: 35909 ms, Packet: Perf, Field: uptime, Data: 35
Timestamp: 35909 ms, Packet: Perf, Field: utilization.aip, Data: 0
Timestamp: 35909 ms, Packet: Perf, Field: stats.violation.power, Data: 0
Timestamp: 35909 ms, Packet: Perf, Field: stats.violation.thermal, Data: 0
Timestamp: 905 ms, Packet: System Status, Field: ib_fw_update.stat, Data: 2
Timestamp: 905 ms, Packet: System Status, Field: ethernet_ports.state, Data: 16777215
Timestamp: 905 ms, Packet: System Status, Field: current_running_fw, Data: 2
Timestamp: 906 ms, Packet: Security, Field: security.hash_spi_code, Data:
da a8 37 cd ed bc 7c 6f df 58 18 06 f9 86 05 4a d9 cf ac 8e 0d 4d 52 f1 da 82 eb 09 b6 69 06 26 f8 89 5c 52 c9 59 5f 13 3b 69 8c f0 c9 43 c8 51
Timestamp: 906 ms, Packet: Security, Field: security.hash_boot_fit, Data:
b9 c0 f4 2e 40 77 ca 3f 21 52 fe 55 7b a0 3a 8c 47 ce cf 42 17 ad 98 e7 c8 98 e1 cd 0d d0 ff f6 a2 0e f5 98 c0 9c b5 61 b4 5c 51 e3 c2 84 d2 f7
Timestamp: 1270 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
Timestamp: 1270 ms, Packet: Mem_utility, Field: memory.total, Data: 131072
Timestamp: 1270 ms, Packet: Mem_utility, Field: memory.free, Data: 130400
Timestamp: 1270 ms, Packet: Mem_utility, Field: memory.used, Data: 672
Timestamp: 1084 ms, Packet: System Information, Field: fw_version.spi, Data: 1.21.1.0
Timestamp: 1084 ms, Packet: System Information, Field: fw_version.os, Data: 1.21.1.0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.uncorrected.aggregate.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.uncorrected.volatile.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.corrected.aggregate.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.corrected.volatile.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.errors.dram.aggregate.total, Data: 0
Timestamp: 35909 ms, Packet: Errors, Field: ecc.mode.current, Data: 1
Timestamp: 35909 ms, Packet: Errors, Field: ecc.mode.pending, Data: 1
The following table describes each field in the output above:
Packet Type |
Field |
Value |
Unit |
Description |
---|---|---|---|---|
Temperature |
temperature.aip |
45 |
Degrees Celsius. Example: 40 C |
Single composite temperature measurement that provides maximal measurement of the SoC, HBM and VRM sensors (with alignment of all to the same threshold). |
Power |
power.draw.54v |
178 |
[Value] W. Example: 40 W |
The total power consumption of the AIP device drawn from the 54V. |
Power |
power.draw.12v |
14 |
[Value] W. Example: 40 W |
The total power consumption of the AIP device drawn from the 12V. |
Health |
health |
1 |
Values: 0 - Unknown, 1 - Normal, 2 - Non-critical, 3 - Critical, 4 - Fatal |
Information on the device’s health with severity indication. |
Performance |
uptime |
35 |
XXXXX sec |
Uptime since last reset (any type of reset) counted from the OS bring-up, i.e. the count is restarted when moving from preboot to management FW. |
Performance |
utilization.aip |
0 |
Percentage. Example: 80% |
Returns a utilization measurement based on the consumed power out of the total power. |
Performance |
stats.violation.power |
0 |
XXXXX nsec |
Duration of latest power-related throttling event per device (ns). This is the actual duration in which PID applied throttling. Internally, it is counted in ms, returned in ns by request. |
Performance |
stats.violation.thermal |
0 |
XXXXX nsec |
Duration of latest thermal-related throttling event per device (ns). Internally counted in ms, returned in ns by request. |
System Status |
ib_fw_update.stat |
2 |
Values: 0 - Unknown, 1 - Locked, 2 - Unlocked |
IB FW update state: Locked or unlocked. |
System Status |
ethernet_ports.state |
16777215 |
Bitmask. Refer to the description. |
Bitmask field for the ethernet ports’ state: Per bit (according to its location): 1- Port is Enabled, 0- Port is Disabled. Bit 0 relates to port 1, bit 1 relates to port 2 and so on until bit 23 which relates to port 24. Since there are 24 ports, the 8 most significant bits are always set to 0. By default, all valid ports are enabled (0x00FF_FFFF). The value can be changed using the related effecter via OOB or using IB FW update with the relevant ITBs. The value is currently represented in decimal format. Make sure to convert it to hexadecimal to correctly interpret and utilize it as a bitmask. |
System Status |
current_running_fw |
2 |
Refer to the description. |
Running FW: 0x1 is preBoot, 0x2 is management FW, 0x3 is preBoot recovery, 0x4 is Margin tool, 0x5 is FW loader agent. |
Security |
security.hash_spi_code |
da a8 37 cd … |
SHA384 |
SHA384 for the Manifest of SPI code (ppBoot and preBoot). |
Security |
security.hash_boot_fit |
b9 c0 f4 2e … |
SHA384 |
SHA384 for the boot fit. |
Mem_utility |
utilization.memory |
0 |
Percentage of used memory. |
HBM memory usage including memory reserved by driver/FW over 1000ms. |
Mem_utility |
memory.total |
131072 |
[Value] in MiB. Example: 32768 MiB |
Total size of available memory (free + used). |
Mem_utility |
memory.free |
130400 |
[Value] in MiB. Example: 32256 MiB |
Size of free memory. |
Mem_utility |
memory.used |
672 |
[Value] in MiB. Example: 512 MiB |
Size of used memory. |
System Information |
fw_version.spi |
1.21.1.0 |
Displayed in the following format |
The version of the preboot which is stored in the SPI flash. |
System Information |
fw_version.os |
1.21.1.0 |
Displayed in the following format |
The version of the arcmgmt, and is available only when the driver is up. |
Errors |
ecc.errors.uncorrected.aggregate.total |
0 |
[Counter]. Example: 0 |
Number of total uncorrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type DERR (double ECC error). The total aggregated number of ECC errors is counted from the time the driver is loaded. |
Errors |
ecc.errors.uncorrected.volatile.total |
0 |
[Counter]. Example: 0 |
Number of total uncorrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type DERR (double ECC error). The total volatile number of ECC errors is counted from the time a file descriptor is opened. Double bit errors are detected but not corrected. |
Errors |
ecc.errors.corrected.aggregate.total |
0 |
[Counter]. Example: 0 |
Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type SERR (single ECC error). The total aggregated number of ECC errors is counted from the time the driver is loaded. |
Errors |
ecc.errors.corrected.volatile.total |
0 |
[Counter]. Example: 0 |
Number of total corrected ECC events for all modules (SRAM, TPC, MME, etc), i.e errors of type SERR (single ECC error). The total volatile number of ECC errors is counted from the time a file descriptor is opened. |
Errors |
ecc.errors.dram.aggregate.total |
0 |
[Counter]. Example: 0 |
Number of uncorrected HBM errors. The total aggregated number of ECC errors is counted from the time the driver is loaded. |
Errors |
ecc.mode.current |
1 |
1: Enabled, 0: Disabled |
The ECC mode that the AIP is currently operating under |
Errors |
ecc.mode.pending |
1 |
1: Enabled, 0: Disabled |
The ECC mode that the AIP will operate on after the next reboot. |
Validating FW Images Authenticity¶
This section explains how to validate the authenticity of firmware images using the hl-smi-async tool by comparing its output with the corresponding
SHA files included in the habanalabs-hypervisor-utils
package:
img-hash-gaudi3-boot-fit.sha384
img-hash-gaudi3-images-pointers.bin.be.sha384
After downloading and installing habanalabs-hypervisor-utils
package which includes hl-smi-async
tool, as described
in Installing Hypervisor Tools Package section, the SHA files will be located in /lib/firmware/habanalabs/gaudi3
.
To retrieve the hashes of the latest FW version, perform the following:
Upgrade FW version to the latest SPI flash version as described in FW_Upgrade_Sec section.
Load the LKD driver on the VM.
Run the hl-smi-async utility on the hypervisor:
sudo /usr/sbin/hl-smi-async -D <pci_addr> -O console -L info -I 1
Using a command line utility (such as xxd), compare the outputs of the
security.hash_spi_code
andsecurity.hash_boot_fit
hashes with the output of the hl-smi-async tool.security.hash_spi_code
hash is the signature of the flash image, whilesecurity.hash_boot_fit
hash is the signature of the FW application (mgmt app) which is loaded directly into RAM. Thehash_spi_code
is accessible during both the preboot and FW application run stages, while thehash_boot_fit
is only available during running the FW application after the LKD driver has been loaded. Therefore, prior to running the LKD driver, only thehash_spi_code
is displayed, while thehash_boot_fit
is shown only after the LKD is running. See the examples below. The values displayed in the outputs vary depending on the release build number in use:In the file output:
$ xxd /lib/firmware/habanalabs/gaudi3/img-hash-gaudi3-images-pointers.bin.be.sha384 00000000: 5c49 ed19 051e d6a6 0381 13d0 7487 dce4 \I..........t... 00000010: 906b bb74 ee06 3569 f67a 6906 eb8a c8a9 .k.t..5i.zi..... 00000020: abbd 0c32 0e59 1e55 48fe aa8d 8732 a6e1 ...2.Y.UH....2..
Look for the following from hl-smi-async tool output:
Security, Field: security.hash_spi_code, Data: 5c 49 ed 19 05 1e d6 a6 03 81 13 d0 74 87 dc e4 90 6b bb 74 ee 06 35 69 f6 7a 69 06 eb 8a c8 a9 ab bd 0c 32 0e 59 1e 55 48 fe aa 8d 87 32 a6 e1
In the file output:
$ xxd /lib/firmware/habanalabs/gaudi3/img-hash-gaudi3-boot-fit.sha384 00000000: 0d8d d302 3aea d1d1 8821 030f 404d bf98 ....:....!..@M.. 00000010: a05c 52b1 3e6e 16c4 1678 0cc7 b195 8c42 .\R.>n...x.....B 00000020: 6675 e039 9adf 4dfb 40d8 f622 3b11 d46c fu.9..M.@..";..l
Look for the following from hl-smi-async tool output:
Security, Field: security.hash_boot_fit, Data: 0d 8d d3 02 3a ea d1 d1 88 21 03 0f 40 4d bf 98 a0 5c 52 b1 3e 6e 16 c4 16 78 0c c7 b1 95 8c 42 66 75 e0 39 9a df 4d fb 40 d8 f6 22 3b 11 d4 6c