hl_smi_async Tool

hl_smi_async is a utility tool for monitoring and managing the Gaudi devices asynchronously. Its basic functionality is a telemetry data collection and logging the collected data. hl_smi_async allows reading the different telemetry packets and handling all the related configurations and handshakes. For more information about the telemetry packets, see the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC. The hl_smi_async application is located in the hl-smi repository: hl-smi/app/hl-smi-async.

Note

  • The tool can run with or without driver loaded.

  • It is only possible to run the tool on a bare-metal machine or a Hypervisor. Running the tool on a guest VM may cause undefined behavior.

Options and Usage

The following table lists the available hl_smi_async options and their usage to help you effectively configure the tool for your specific needs.

Example:

sudo /usr/sbin/hl-smi-async -D b1:00.0 -O console -L info -I 5

Option

Description

-h, --help

Outputs the help message and exits.

-V, --version

Outputs version information and exits.

-D, --device [device]

Specifies the PCIe address of the device (e.g. b1:00.0).

-O, --output [output]

Specifies the telemetry output type. Valid values:

  • console (default) - Prints all collected telemetry to the console.

  • logfile - Prints all collected telemetry to a logfile: telem_log.txt

-L, --loglevel [level]

Specifies the log level. Valid values:

  • info (default) - Log informative messages.

  • debug - Log debug info.

  • error - Log all errors.

-I, --iterations [number]

Specifies the number of iterations. If not set (default), the tool runs in an endless loop.

Output example:

  • Driver is not up (preboot mode):

    Reading all synced telemetry packets
    Timestamp: 10943 ms, Packet: Temperature, Field: temperature.aip, Data: 38
    Timestamp: 10861 ms, Packet: Power, Field: power.draw.54v, Data: 157
    Timestamp: 10979 ms, Packet: Health, Field: health, Data: 1
    Timestamp: 10861 ms, Packet: Perf, Field: uptime, Data: 10
    Timestamp: 3858 ms, Packet: sys_stat, Field: ib_fw_update.stat, Data: 2
    Timestamp: 3859 ms, Packet: Security, Field: security.hash_spi_code, Data:
    4a ac 10 c9 8c 97 76 1f b0 b6 7d 04 7f 67 f9 e0 42 8c cc f3 b4 7a 22 05 74 07 23 f1 ee 14 31 43 cd 99 55 e3 05 9a 3e 3e 69 8c 0d 4d 1f 1b 0b f1
    Timestamp: 3858 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
    
  • Driver is up:

    Starting telemetry data collection
    Reading all synced telemetry packets
    Timestamp: 42164 ms, Packet: Temperature, Field: temperature.aip, Data: 44
    Timestamp: 42420 ms, Packet: Power, Field: power.draw.54v, Data: 157
    Timestamp: 42420 ms, Packet: Power, Field: power.draw.12v, Data: 14
    Timestamp: 42026 ms, Packet: Health, Field: health, Data: 3
    Timestamp: 42419 ms, Packet: Perf, Field: uptime, Data: 42
    Timestamp: 919 ms, Packet: System Status, Field: ib_fw_update.stat, Data: 2
    Timestamp: 919 ms, Packet: System Status, Field: ethernet_ports.state, Data: 16777215
    Timestamp: 919 ms, Packet: Security, Field: security.hash_spi_code, Data:
    5c 49 ed 19 05 1e d6 a6 03 81 13 d0 74 87 dc e4 90 6b bb 74 ee 06 35 69 f6 7a 69 06 eb 8a c8 a9 ab bd 0c 32 0e 59 1e 55 48 fe aa 8d 87 32 a6 e1
    Timestamp: 919 ms, Packet: Security, Field: security.hash_boot_fit, Data:
    0d 8d d3 02 3a ea d1 d1 88 21 03 0f 40 4d bf 98 a0 5c 52 b1 3e 6e 16 c4 16 78 0c c7 b1 95 8c 42 66 75 e0 39 9a df 4d fb 40 d8 f6 22 3b 11 d4 6c
    Timestamp: 5061 ms, Packet: Mem_utility, Field: utilization.memory, Data: 0
    Reading all synced telemetry packets
    

Some fields are invalid when the driver is not up (preboot mode). For more information, refer to the Gaudi 3 In-Band Telemetry for Hypervisor document located in the Intel Gaudi vault and Intel RDC.