hl_smi_async Tool

hl_smi_async is a utility tool for monitoring and managing the Gaudi devices asynchronously. Its basic functionality is a telemetry data collection and logging the collected data. hl_smi_async allows reading the different telemetry packets and handling all the related configurations and handshakes. For more information about the telemetry packets, see the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC. The hl_smi_async application is located in the hl-smi repository: hl-smi/app/hl-smi-async.

Note

  • The tool can run with or without driver loaded.

  • It is only possible to run the tool on a bare-metal machine or a Hypervisor. Running the tool on a guest VM may cause undefined behavior.

Options and Usage

The following table lists the available hl_smi_async options and their usage to help you effectively configure the tool for your specific needs.

Example:

sudo /usr/sbin/hl-smi-async -D b1:00.0 -O console -L info -I 5

Option

Description

-h, --help

Outputs the help message and exits.

-V, --version

Outputs version information and exits.

-D, --device [device]

Specifies the PCIe address of the device (e.g. b1:00.0).

-O, --output [output]

Specifies the telemetry output type. Valid values:

  • console (default) - Prints all collected telemetry to the console.

  • logfile - Prints all collected telemetry to a logfile: telem_log.txt

-L, --loglevel [level]

Specifies the log level. Valid values:

  • info (default) - Log informative messages.

  • debug - Log debug info.

  • error - Log all errors.

-I, --iterations [number]

Specifies the number of iterations. If not set (default), the tool runs in an endless loop.

Output example:

  • Driver is not up (preboot mode):

    Reading all synced telemetry packets
    Timestamp: 10943 ms, Packet: Temperature, Field: temperature.aip, Data: 38
    Timestamp: 10861 ms, Packet: Power, Field: power.draw.54v, Data: 157
    Timestamp: 10979 ms, Packet: Health, Field: health, Data: 1
    Timestamp: 10861 ms, Packet: Perf, Field: uptime, Data: 10
    Timestamp: 3858 ms, Packet: sys_stat, Field: ib_fw_update.stat, Data: 2
    Timestamp: 3859 ms, Packet: Security, Field: security.hash_spi_code, Data:
    4a ac 10 c9 8c 97 76 1f b0 b6 7d 04 7f 67 f9 e0 42 8c cc f3 b4 7a 22 05 74 07 23 f1 ee 14 31 43 cd 99 55 e3 05 9a 3e 3e 69 8c 0d 4d 1f 1b 0b f1
    Timestamp: 3858 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
    
  • Driver is up:

    Starting telemetry data collection
    Reading all synced telemetry packets
    Timestamp: 42164 ms, Packet: Temperature, Field: temperature.aip, Data: 44
    Timestamp: 42420 ms, Packet: Power, Field: power.draw.54v, Data: 157
    Timestamp: 42420 ms, Packet: Power, Field: power.draw.12v, Data: 14
    Timestamp: 42026 ms, Packet: Health, Field: health, Data: 3
    Timestamp: 42419 ms, Packet: Perf, Field: uptime, Data: 42
    Timestamp: 919 ms, Packet: System Status, Field: ib_fw_update.stat, Data: 2
    Timestamp: 919 ms, Packet: System Status, Field: ethernet_ports.state, Data: 16777215
    Timestamp: 919 ms, Packet: Security, Field: security.hash_spi_code, Data:
    5c 49 ed 19 05 1e d6 a6 03 81 13 d0 74 87 dc e4 90 6b bb 74 ee 06 35 69 f6 7a 69 06 eb 8a c8 a9 ab bd 0c 32 0e 59 1e 55 48 fe aa 8d 87 32 a6 e1
    Timestamp: 919 ms, Packet: Security, Field: security.hash_boot_fit, Data:
    0d 8d d3 02 3a ea d1 d1 88 21 03 0f 40 4d bf 98 a0 5c 52 b1 3e 6e 16 c4 16 78 0c c7 b1 95 8c 42 66 75 e0 39 9a df 4d fb 40 d8 f6 22 3b 11 d4 6c
    Timestamp: 5061 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
    Reading all synced telemetry packets
    

Some fields are invalid when the driver is not up (preboot mode). For more information, refer to the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC.

Validating FW Images Authenticity

This section explains how to validate the authenticity of firmware images using the hl-smi-async tool by comparing its output with the corresponding SHA files included in the habanalabs-hypervisor-utils package:

  • img-hash-gaudi3-boot-fit.sha384

  • img-hash-gaudi3-images-pointers.bin.be.sha384

After downloading and installing habanalabs-hypervisor-utils package which includes hl-smi-async tool, as described in Installing Hypervisor Tools Package section, the SHA files will be located in /lib/firmware/habanalabs/gaudi3.

To retrieve the hashes of the latest FW version, perform the following:

  1. Upgrade FW version to the latest SPI flash version as described in Firmware Upgrade section.

  2. Load the LKD driver on the VM.

  3. Run the hl-smi-async utility on the hypervisor:

    sudo /usr/sbin/hl-smi-async -D <pci_addr> -O console -L info -I 1
    
  4. Using a command line utility (such as xxd), compare the outputs of the security.hash_spi_code and security.hash_boot_fit hashes with the output of the hl-smi-async tool. security.hash_spi_code hash is the signature of the flash image, while security.hash_boot_fit hash is the signature of the FW application (mgmt app) which is loaded directly into RAM. The hash_spi_code is accessible during both the preboot and FW application run stages, while the hash_boot_fit is only available during running the FW application after the LKD driver has been loaded. Therefore, prior to running the LKD driver, only the hash_spi_code is displayed, while the hash_boot_fit is shown only after the LKD is running. See the examples below. The values displayed in the outputs vary depending on the release build number in use:

    In the file output:

    $ xxd /lib/firmware/habanalabs/gaudi3/img-hash-gaudi3-images-pointers.bin.be.sha384
    00000000: 5c49 ed19 051e d6a6 0381 13d0 7487 dce4  \I..........t...
    00000010: 906b bb74 ee06 3569 f67a 6906 eb8a c8a9  .k.t..5i.zi.....
    00000020: abbd 0c32 0e59 1e55 48fe aa8d 8732 a6e1  ...2.Y.UH....2..
    

    Look for the following from hl-smi-async tool output:

    Security, Field: security.hash_spi_code, Data:
    5c 49 ed 19 05 1e d6 a6 03 81 13 d0 74 87 dc e4 90 6b bb 74 ee 06 35 69 f6 7a 69 06 eb 8a c8 a9 ab bd 0c 32 0e 59 1e 55 48 fe aa 8d 87 32 a6 e1
    

    In the file output:

    $ xxd /lib/firmware/habanalabs/gaudi3/img-hash-gaudi3-boot-fit.sha384
    00000000: 0d8d d302 3aea d1d1 8821 030f 404d bf98  ....:....!..@M..
    00000010: a05c 52b1 3e6e 16c4 1678 0cc7 b195 8c42  .\R.>n...x.....B
    00000020: 6675 e039 9adf 4dfb 40d8 f622 3b11 d46c  fu.9..M.@..";..l
    

    Look for the following from hl-smi-async tool output:

    Security, Field: security.hash_boot_fit, Data:
    0d 8d d3 02 3a ea d1 d1 88 21 03 0f 40 4d bf 98 a0 5c 52 b1 3e 6e 16 c4 16 78 0c c7 b1 95 8c 42 66 75 e0 39 9a df 4d fb 40 d8 f6 22 3b 11 d4 6c