2. Habana Labs Management Library (HLML) API Reference

2.1. C API

The Habana Labs Management Library (HLML) is a C-based programmatic interface for monitoring and managing various states within Habana Labs AI accelerators. HLML can be used as a platform for building 3rd party applications and is also the underlying library used for the hl-smi tool (see System Management Interface Tool User Guide).

All APIs, structures and enums used are defined in the public header file. APIs that provide information about a specific device require a handle as a parameter. This handle can be retrieved using the device’s index or PCI address.

2.1.1. Return Value

The return value of all the APIs is the following enum:

typedef enum hlml_return {

    HLML_SUCCESS = 0,

    HLML_ERROR_UNINITIALIZED = 1,

    HLML_ERROR_INVALID_ARGUMENT = 2,

    HLML_ERROR_NOT_SUPPORTED = 3,

    HLML_ERROR_ALREADY_INITIALIZED = 5,

    HLML_ERROR_NOT_FOUND = 6,

    HLML_ERROR_INSUFFICIENT_SIZE = 7,

    HLML_ERROR_DRIVER_NOT_LOADED = 9,

    HLML_ERROR_TIMEOUT = 10,

    HLML_ERROR_AIP_IS_LOST = 15,

    HLML_ERROR_MEMORY = 20,

    HLML_ERROR_NO_DATA = 21,

    HLML_ERROR_UNKNOWN = 49,

} hlml_return_t;

2.2. Common APIs

2.2.1. hlml_return_t hlml_init( void )

Operation:

This API should always be called first before any other API can be called.

Return Value:

  • HLML_SUCCESS if initialization succeeded.

  • HLML_ERROR_UNKNOWN on any unexpected error.

  • HLML_ERROR_ALREADY_INITIALIZED if function was already called.

2.2.2. hlml_return_t hlml_init_with_flags( unsigned int mode )

Operation:

This API should always be called first before any other API can be called.

Parameters:

Parameter

Description

Mode

[in] Initialization mode – should be set to 0.

Return Value:

  • HLML_SUCCESS if initialization succeeded.

  • HLML_ERROR_UNKNOWN on any unexpected error.

  • HLML_ERROR_ALREADY_INITIALIZED if function was already called.

2.2.3. hlml_return_t hlml_return_t hlml_shutdown(void)

Operation:

This API should always be called last, after completing all calls to other APIs. The API properly cleans up allocated resources.

Return Value:

  • HLML_SUCCESS if shutdown succeeded.

2.2.4. hlml_return_t hlml_device_get_count ( unsigned int* device_count )

Operation:

Retrieves the number of AIP devices in the system.

Parameters:

Parameter

Description

device_count

[out] Reference in which to return the number of accessible AIPs.

Return Value:

  • HLML_SUCCESS if device_count has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device_count is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3. Per device APIs

2.3.1. hlml_return_t hlml_device_get_handle_by_pci_bus_id ( const char *pci_addr, hlml_device_t* device )

Operation:

Acquires the handle for a device, based on its PCI address.

Parameters:

Parameter

Description

pci_addr

[in] The bus ID of the target AIP (The tuple domain:bus:device.function).

device

[out] Reference in which to return the device handle.

Return Value:

  • HLML_SUCCESS if device has been found.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if pci_addr is invalid or device is NULL.

  • HLML_ERROR_NOT_FOUND if the PCI address does not exist.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.2. hlml_return_t hlml_device_get_handle_by_index ( unsigned int index, hlml_device_t* device )

Operation:

Acquires the handle for a device, based on its index.

Parameters:

Parameter

Description

index

[in] Index is a valid integer {x} of existing entry /dev/hl{x} in file system (index of a device that was successfully initialized by the driver).

device

[out] Reference in which to return the device handle.

Return Value:

  • HLML_SUCCESS if device has been found.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if index is invalid or device is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.3. hlml_return_t hlml_device_get_handle_by_UUID ( const char *uuid, hlml_device_t* device )

Operation:

Acquires the handle for a device, based on its UUID.

Parameters:

Parameter

Description

uuid

[in] The UUID of the target AIP.

device

[out] Reference in which to return the device handle.

Return Value:

  • HLML_SUCCESS if device has been found.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if UUID is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.4. hlml_return_t hlml_device_get_name ( hlml_device_t device, char* name, unsigned int  length)

Operation:

Retrieves the name of this AIP.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

name

[out] Reference in which to return the product name.

length

[in] The maximum allowed length of the string returned in name.

Return Value:

  • HLML_SUCCESS if name has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or name is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.5. hlml_return_t hlml_device_get_pci_info ( hlml_device_t device, hlml_pci_info_t *pci )

Operation:

Retrieves the PCI attributes of this AIP.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

pci

[out] Reference in which to return the PCI info

#define PCI_LINK_INFO_LEN 10

typedef struct hlml_pci_cap {

    char link_speed[PCI_LINK_INFO_LEN]; // current pci link speed

    char link_width[PCI_LINK_INFO_LEN]; // current pci link width

} html_pci_cap_t;

typedef struct hlml_pci_info {

    unsigned int bus; // The bus on which the device resides, 0 to 0xff

    char bus_id[PCI_ADDR_LEN]; // The tuple domain:bus:device.function

    unsigned int device; // The device's id on the bus, 0 to 31.

    unsigned int domain; // The PCI domain on which the device's bus resides

    unsigned int pci_device_id; //The combined 16b deviceId and 16b vendor id

    hlml_pci_cap_t caps; //The device capabilities

} hlml_pci_info_t;

Return Value:

  • HLML_SUCCESS if pci has been populated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or pci is NULL.

  • HLML_ERROR_AIP_IS_LOST if PCI data is missing.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.6. hlml_return_t hlml_device_get_clock_info ( hlml_device_t device, hlml_clock_type_t type, unsigned int *clock )

Operation:

Retrieves the current clock speeds of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

type

[in] Identify which clock to query.

clock

[out] Reference in which to return the clock speed in MHz.

typedef enum hlml_clock_type {

    HLML_CLOCK_SOC = 0,

    HLML_CLOCK_IC = 1,

    HLML_CLOCK_MME = 2,

    HLML_CLOCK_TPC = 3,

    HLML_CLOCK_COUNT

} hlml_clock_type_t;

Note

HLML_CLOCK_SOC is supported only for Habana(R) Gaudi(R). HLML_CLOCK_IC, HLML_CLOCK_MME and HLML_CLOCK_TPC is supported for Goya only.

Return Value:

  • HLML_SUCCESS if clock has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device or type is invalid.

  • HLML_ERROR_NOT_SUPPORTED if clock is NULL.

  • HLML_ERROR_AIP_IS_LOST if the target AIP has fallen off the bus or is otherwise inaccessible.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.7. hlml_return_t hlml_device_get_max_clock_info ( hlml_device_t device, hlml_clock_type_t type, unsigned int *clock )

Operation:

Retrieves the maximum clock speeds of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

type

[in] Identify which clock to query.

clock

[out] Reference in which to return the max clock speed in MHz.

typedef enum hlml_clock_type {

    HLML_CLOCK_SOC = 0,

    HLML_CLOCK_IC = 1,

    HLML_CLOCK_MME = 2,

    HLML_CLOCK_TPC = 3,

    HLML_CLOCK_COUNT

} hlml_clock_type_t;

Return Value:

  • HLML_SUCCESS if clock has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device or type is invalid or clock is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.8. hlml_return_t hlml_device_get_utilization_rates ( hlml_device_t device, hlml_utilization_t *utilization )

Operation:

Returns the utilization over the past second, in percentage, during which one or more kernels was running on the AIP.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

utilization

[out] Reference in which to return the utilization information.

typedef struct hlml_utilization {

    unsigned int aip;

} hlml_utilization_t;

Return Value:

  • HLML_SUCCESS If utilization has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or utilization is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

2.3.9. hlml_return_t hlml_device_get_memory_info ( hlml_device_t device, hlml_memory_t *memory )

Operation:

Retrieves the total, used and free memory.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

memory

[out] Reference in which to return the memory information.

typedef struct hlml_memory {

    unsigned long long free; // Free memory (in bytes)

    unsigned long long total; // Total installed memory (in bytes)

    unsigned long long used; // Used memory (in bytes)

} hlml_memory_t;

Return Value:

  • HLML_SUCCESS if memory has been populated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or memory is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.10. hlml_return_t hlml_device_get_temperature ( hlml_device_t device, hlml_temperature_sensors_t sensor_type, unsigned int *temp )

Operation:

Retrieves the current temperature of the higher sensor_type, in degrees C.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

sensor_type

[in] Flag that indicates if the sensor is on the AIP or on the board.

temp

[out] Reference in which to return the temperature reading.

typedef enum hlml_temperature_sensors {

    HLML_TEMPERATURE_ON_AIP = 0,

    HLML_TEMPERATURE_ON_BOARD = 1

} hlml_temperature_sensors_t;

Return Value:

  • HLML_SUCCESS if temp has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, sensorType is invalid or temp is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.11. hlml_return_t hlml_device_get_temperature_threshold ( hlml_device_t device, hlml_temperature_thresholds_t threshold_type, unsigned int *temp )

Operation:

Retrieves the known temperature threshold for the AIP with the specified threshold type in degrees C. Currently, this is a hard-coded value for all the types.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

Threshold_type

[in] The type of threshold value queried.

temp

[out] Reference in which to return the temperature reading.

typedef enum hlml_temperature_thresholds {

    HLML_TEMPERATURE_THRESHOLD_SHUTDOWN = 0,

    HLML_TEMPERATURE_THRESHOLD_SLOWDOWN = 1,

    HLML_TEMPERATURE_THRESHOLD_MEM_MAX = 2,

    HLML_TEMPERATURE_THRESHOLD_GPU_MAX = 3,

    HLML_TEMPERATURE_THRESHOLD_COUNT

} hlml_temperature_thresholds_t;

Return Value:

  • HLML_SUCCESS if temp has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, threshold_type is invalid or temp is NULL.

2.3.12. hlml_return_t hlml_device_get_persistence_mode ( hlml_device_t device, hlml_enable_state_t *mode )

Operation:

API is not supported.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

mode

[out] Reference in which to return the current driver persistence mode.

typedef enum hlml_enable_state {

    HLML_FEATURE_DISABLED = 0,

    HLML_FEATURE_ENABLED = 1

} hlml_enable_state_t;

Return Value:

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or mode is NULL

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature

2.3.13. hlml_return_t hlml_device_get_performance_state ( hlml_device_t device, hlml_p_states_t *p_state )

Operation:

API is not supported.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

p_state

[out] Reference in which to return the performance state reading

typedef enum hlml_p_states {

    HLML_PSTATE_0 = 0,

    HLML_PSTATE_UNKNOWN = 32

} hlml_p_states_t;

Return Value:

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or p_state is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

2.3.14. hlml_return_t hlml_device_get_power_usage ( hlml_device_t device, unsigned int *power )

Operation:

Retrieves power usage for this AIP in milliwatts and its associated circuitry (e.g. memory).

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

power

[out] Reference in which to return the power usage information.

Return Value:

  • HLML_SUCCESS if power has been populated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or power is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.15. hlml_return_t hlml_device_get_power_management_default_limit ( hlml_device_t device, unsigned int *default_limit )

Operation:

Retrieves default power management limit on this device, in milliwatts. The default power management limit is a power management limit that the device boots with.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

default_limit

[out] Reference in which to return the default power management limit in milliwatts

Return Value:

  • HLML_SUCCESS if default limit has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or default_limit is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.16. hlml_return_t hlml_device_get_ecc_mode ( hlml_device_t device, hlml_enable_state_t* current, hlml_enable_state_t* pending )

Operation:

Retrieves the current and pending ECC modes of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

current

[out] Reference in which to return the current ECC mode.

pending

[out] Reference in which to return the pending ECC mode.

typedef enum hlml_enable_state {

    HLML_FEATURE_DISABLED = 0,

    HLML_FEATURE_ENABLED = 1

} hlml_enable_state_t;

Return Value:

  • HLML_SUCCESS if ECC mode is set

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or current or pending are NULL.

2.3.17. hlml_return_t hlml_device_get_total_ecc_errors ( hlml_device_t device, hlml_memory_error_type_t error_type, hlml_ecc_counter_type_t counter_type, unsigned long long* ecc_counts )

Operation:

Returns the number of ECC errors for a specific device, since the last device reset, or since the driver was installed. Only the number of uncorrected errors is supported.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

error_type

[in] Flag that specifies the type of the errors.

counter_type

[in] Flag that specifies the countertype of the errors.

ecc_counts

[out] Reference in which to return the specified ECC errors.

typedef enum hlml_memory_error_type {

    HLML_MEMORY_ERROR_TYPE_CORRECTED = 0, // Not supported

    HLML_MEMORY_ERROR_TYPE_UNCORRECTED = 1,

    HLML_MEMORY_ERROR_TYPE_COUNT

} hlml_memory_error_type_t;

enum hlml_ecc_counter_type {

    HLML_VOLATILE_ECC = 0, // Since last device reset

    HLML_AGGREGATE_ECC = 1, // Since driver is up

    HLML_ECC_COUNTER_TYPE_COUNT

} hlml_ecc_counter_type_t;

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device, error type or counter type is invalid, or ecc counts is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

  • HLML_ERROR_UNKNOWN if error occurred during ECC error retrieval

2.3.18. hlml_return_t hlml_device_get_memory_error_counter( hlml_device_t device, hlml_memory_error_type_t error_type, hlml_ecc_counter_type_t counter_type, hlml_memory_location_type_t location, unsigned long long *ecc_counts)

Operation:

Returns the number of ECC errors for a specific device and location, since the last device reset, or since the driver was installed.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

error_type

[in] Flag that specifies the type of errors.

counter_type

[in] Flag that specifies the countertype of the errors.

location

[in] Flag that specifies the location of the errors

ecc_counts

[out] Reference in which to return the specified ECC errors.

typedef enum hlml_memory_error_type {

    HLML_MEMORY_ERROR_TYPE_CORRECTED = 0, // Not supported

    HLML_MEMORY_ERROR_TYPE_UNCORRECTED = 1,

    HLML_MEMORY_ERROR_TYPE_COUNT

} hlml_memory_error_type_t;

enum hlml_ecc_counter_type {

    HLML_VOLATILE_ECC = 0, // Since last device reset

    HLML_AGGREGATE_ECC = 1, // Since driver is up

    HLML_ECC_COUNTER_TYPE_COUNT

} hlml_ecc_counter_type_t;

typedef enum hlml_memory_location_type {

    HLML_MEMORY_LOCATION_SRAM = 0,

    HLML_MEMORY_LOCATION_DRAM = 1,

    HLML_MEMORY_LOCATION_COUNT
} hlml_memory_location_type_t;

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device, error type or counter type is invalid, or ecc counts is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

  • HLML_ERROR_UNKNOWN if error occurred during ECC error retrieval

2.3.19. hlml_return_t hlml_device_get_uuid(hlml_device_t device, char *uuid, unsigned int length)

Operation:

Returns the UUID for the device as string.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

uuid

[out] The UUID.

length

[in] The maximum size of the string allocated by the user.

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or UUID is NULL

  • HLML_ERROR_INSUFFICIENT_SIZE if the UUID string length is longer than the size allocated by the user.

2.3.20. hlml_return_t hlml_device_get_minor_number(hlml_device_t device, unsigned int *minor_number)

Operation: Retrieves the minor number of the device. The minor number of the device is such that the Habanalabs device node file for each device will have the following form: /sys/class/habanalabs/hl[minor number].

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

minor_number

[out] Reference in which to return the minor number for the device.

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or minor_number is NULL.

  • HLML_ERROR_NO_DATA if unable to retrieve the minor number for any reason.

2.3.21. hlml_return_t hlml_event_set_create(hlml_event_set_t *set)

Operation:

Creates an empty set of events. Event set should be freed by hlml_event_set_free.

Parameters:

Parameter

Description

set

[out] Reference in which to return the event handle.

typedef void* hlml_event_set_t;

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if set is NULL.

  • HLML_ERROR_MEMORY if failed to allocate a set.

2.3.22. hlml_return_t hlml_event_set_free(hlml_event_set_t set)

Operation:

Releases a set of events.

Parameters:

Parameter

Description

set

[in] Reference to events to be released

  • HLML_SUCCESS if ecc_counts was allocated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if set is invalid.

2.3.23. hlml_return_t hlml_device_register_events(hlml_device_t device, unsigned long long event_types, hlml_event_set_t set)

Operation:

Starts recording of events on specified devices and add the events to the specified hlml_event_set_t. This call starts recording of events on a specific device. All events that occurred before this call are not recorded.

Supported events:

  • ECC single/double bit errors – BIT(0)

  • Critical errors that occurred – BIT(1)

  • Clock rate changes – BIT(2)

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

event_types

[in] Bitmask of event types to record.

set

[in] Set to which add new event types.

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device or set is invalid or event_types is 0.

  • HLML_ERROR_UNKNOWN if the failed to retrieve information regarding events.

2.3.24. hlml_return_t hlml_event_set_wait(hlml_event_set_t set, hlml_event_data_t *data, unsigned int timeoutms)

Operation:

Waits on events and delivers events. If some events are ready to be delivered at the time of the call, function returns immediately. If there are no events ready to be delivered, function sleeps until the event arrives but not longer than the specified timeout.

Parameters:

Parameter

Description

set

[in] Reference to set of events to wait on.

Data

[out] Reference in which to return event data.

timeoutms

[in] Maximum amount of wait time in milliseconds for registered event.

typedef struct hlml_event_data {

    hlml_device_t device; /* Specific device where the event occurred. */

    unsigned long long event_type; /* Specific event that occurred */

} hlml_event_data_t;

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if set is invalid or data is NULL.

  • HLML_ERROR_UNKNOWN if the failed to retrieve information regarding events.

  • HLML_ERROR_TIMEOUT if we did not get any events during timeout ms.

2.3.25. hlml_return_t hlml_device_err_inject(hlml_device_t device, hlml_err_inject_t err_type)

Operation:

Injects error to test Habana devices responsivity.

Supported events:

typedef enum hlml_err_inject {

    HLML_ERR_INJECT_ENDLESS_COMMAND = 0,

    HLML_ERR_INJECT_NON_FATAL_EVENT = 1,

    HLML_ERR_INJECT_FATAL_EVENT = 2,

    HLML_ERR_INJECT_LOSS_OF_HEARTBEAT = 3,

    HLML_ERR_INJECT_THERMAL_EVENT = 4,

    HLML_ERR_INJECT_COUNT

} hlml_err_inject_t;

*hlml_err_inject_t is based on hlthunk library interface.*

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

err_type

[in] The error type to inject.

Return Value:

  • HLML_SUCCESS if error was injected successfully.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid.

  • HLML_ERROR_UNKNOWN if lib failed to inject the error.

2.3.26. hlml_return_t hlml_device_get_mac_info(hlml_device_t device, hlml_mac_info_t *mac_info, unsigned int mac_info_size, unsigned int start_mac_id, unsigned int *actual_mac_info_count)

Operation:

Get MAC addresses of device.

#define ETHER_ADDR_LEN 6

typedef struct hlml_mac_info {

    unsigned char addr[ETHER_ADDR_LEN];

    int id;

} hlml_mac_info_t;

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

mac_info

[out] Array of MAC addresses allocated by the user.

mac_info_size

[in] Number of elements in mac_info[].

start_mac_id

[in] MAC id to start from. Number in the range of [1…20].

actual_mac_info_count

[out] The actual number of elements that we set in the user’s array (mac_info).

Return Value:

  • HLML_SUCCESS if MAC address was retrieved successfully.

  • HLML_ERROR_INVALID_ARGUMENT if device/mac_info/actual_mac_info_count are invalid. Or if start_mac_id is <1 or >20.

  • HLML_ERROR_NO_DATA if requested start MAC address is bigger than the MAC count for the device.

2.3.27. hlml_return_t hlml_device_get_hl_revision(hlml_device_t device, int *hl_revision)

Operation:

Get the HL Revision.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

hl_revision

[out] Reference in which to return the hl_revision number.

Return Value:

  • HLML_SUCCESS if hl_revision has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or hl_revision is NULL.

  • HLML_ERROR_NOT_FOUND if failed to retrieve hl_revision.

2.3.28. hlml_return_t hlml_device_get_pcb_info(hlml_device_t device, hlml_pcb_info_t *pcb)

Operation: Get the PCB info.

#define HL_FIELD_MAX_SIZE 32

/*
 * pcb_ver - The device's PCB version
 * pcb_assembly_ver - The device's PCB Assembly version
 */

typedef struct hlml_pcb_info {

    char pcb_ver[HL_FIELD_MAX_SIZE];

    char pcb_assembly_ver[HL_FIELD_MAX_SIZE];

} hlml_pcb_info_t;

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

pcb

[out] Reference in which to return the pcb info.

Return Value:

  • HLML_SUCCESS if pcb info has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or pcb is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.29. hlml_return_t hlml_device_get_serial(hlml_device_t device, char *serial, unsigned int length)

Operation: Retrieves the globally unique board serial number associated with this device’s board.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

serial

[out] Reference in which to return the board/module serial number.

length

[in] The maximum allowed length of the string returned in serial.

Return Value:

  • HLML_SUCCESS if serial has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or serial is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

2.3.30. hlml_return_t hlml_device_get_board_id(hlml_device_t device, unsigned int* board_id)

Operation: Retrieves the device boardId from 0-7.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

board_id

[out] Reference in which to return the device’s board ID.

Return Value:

  • HLML_SUCCESS if board_id has been set

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or board_id is NULL.

  • HLML_ERROR_NOT_FOUND if no AIP matching device was found.

2.3.31. hlml_return_t hlml_device_get_pcie_throughput(hlml_device_t device, hlml_pcie_util_counter_t counter, unsigned int *value);

Operation: Retrieve PCIe utilization information. This function is querying PCIe throughput that was calculated over a 10ms interval.

typedef enum hlml_pcie_util_counter {

    HLML_PCIE_UTIL_TX_BYTES = 0,

    HLML_PCIE_UTIL_RX_BYTES = 1,

    HLML_PCIE_UTIL_COUNT,

} hlml_pcie_util_counter_t;

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

counter

[in] The specific counter that should be queried.

value

[out] Reference in which to return throughput in KB/s.

Return Value:

  • HLML_SUCCESS if value has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device or counter is invalid, or value is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.32. hlml_return_t hlml_device_get_pcie_replay_counter(hlml_device_t device, unsigned int *value)

Operation: Retrieve the PCIe replay counter.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

value

[out] Reference in which to return the counter’s value.

Return Value:

  • HLML_SUCCESS if value has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device invalid, or value is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.35. hlml_return_t hlml_device_get_current_clocks_throttle_reasons(hlml_device_t device, unsigned long long *clocks_throttle_reasons)

Operation: Retrieves current clocks throttling reasons.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

clocks_throttle_reasons

[out] Reference in which to return bitmask of active clocks throttle reasons.

Return Value:

  • HLML_SUCCESS if clocks_throttle_reasons was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or clocks_throttle_reasons is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.36. hlml_return_t hlml_device_get_total_energy_consumption(hlml_device_t device, unsigned long long *energy);

Operation: Retrieves total energy consumption in millijoules (mJ) since the driver was last reloaded.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

energy

[out] Reference in which to return energy consumption.

Return Value:

  • HLML_SUCCESS if clocks_throttle_reasons was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or clocks_throttle_reasons is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.37. hlml_return_t hlml_get_mac_addr_info(hlml_device_t device, uint64_t *mask, uint64_t *ext_mask);

Operation:

Retrieves the masks for supported ports and external ports.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

mask

[out] Array of 2 uint64_t. Reference in which to return a bitmask for supported ports.

ext_mask

[out] Array of 2 uint64_t. Reference in which to return a bitmask for external ports within the supported ports.

Return Value:

  • HLML_SUCCESS if managed to retrieve the ports masks.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or mask is NULL or ext_mask is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.39. hlml_return_t hlml_nic_get_statistics(hlml_device_t device, hlml_nic_stats_info_t *stats_info);

Operation:

Retrieves the NICs statistics for the requested port.

typedef struct hlml_nic_stats_info {
        uint32_t port;
    char *str_buf;
    uint64_t *val_buf;
    uint32_t *num_of_counters_out;
} hlml_nic_stats_info_t;

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

stats_info.port

[in] Port for which the statistics are requested.

stats_info.str_buf

[out] Reference in which the names of the counters are stored. Buffer is allocated by the user in the size of num_of_counters_in * 32.

stats_info.val_buf

[out] Reference in which the values of the counters are stored. This buffer is allocated by the user in the size of num_of_counters_in * sizeof(uint64_t).

stats_info. num_of_counters_out

[out] Reference in which the actual number of counters retrieved is stored.

Return value:

  • HLML_SUCCESS if managed to retrieve the NICs statistics.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, stats_info.str_buf is NULL, stats_info.val_buf is NULL or stats_info.num_of_counters_out is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.40. hlml_return_t hlml_device_clear_cpu_affinity(hlml_device_t device);

Operation:

Clear all affinity bindings for the calling process.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

Return value:

  • HLML_SUCCESS if managed to retrieve the ports masks.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.41. hlml_return_t hlml_device_get_cpu_affinity(hlml_device_t device, unsigned int cpu_set_size, unsigned long *cpu_set);

Operation:

Retrieves an array of unsigned longs (sized to cpu_set_size) of bitmasks with the ideal CPU affinity for the device. For example (64 bit machine), if processors 0, 1, 64, and 65 are ideal for the device and cpuSetSize == 2, result[0] = 0x3, result[1] = 0x3. This is equivalent to calling hlml_device_get_cpu_affinity_within_scope with HLML_AFFINITY_SCOPE_NODE.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

cpu_set_size

[in] The size of the cpu_set array that is safe to access,

cpu_set

[out] Array reference in which to return a bitmask of CPUs, 64 CPUs per unsigned long on 64-bit machines, 32 on 32-bit machines

Return value:

  • HLML_SUCCESS if managed to retrieve the ports masks.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or cpu_set is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.42. hlml_return_t hlml_device_get_cpu_affinity_within_scope(hlml_device_t device, unsigned int cpu_set_size, unsigned long *cpu_set, hlml_affinity_scope_t scope);

Operation:

Retrieves an array of unsigned ints (sized to cpu_set_size) of bitmasks with the ideal CPU affinity within node or socket for the device. For example (64 bit machine), if processors 0, 1, 64, and 65 are ideal for the device and cpuSetSize == 2, result[0] = 0x3, result[1] = 0x3.

#define HLML_AFFINITY_SCOPE_NODE 0   //Scope of NUMA node for affinity queries.
#define HLML_AFFINITY_SCOPE_SOCKET 1 //Scope of processor socket for affinity queries

Parameters

Parameter

Description

device

[in] The identifier of the target AIP.

cpu_set_size

[in] The size of the cpu_set array that is safe to access.

cpu_set

[out] Array reference in which to return a bitmask of CPUs, 64 CPUs per unsigned long on 64-bit machines, 32 on 32-bit machines.

scop

[in] Scope that changes the default behavior.

Return value:

  • HLML_SUCCESS if managed to retrieve the ports masks.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or cpu_set is NULL.

  • HLML_ERROR_NOT_SUPPORTED if scope is not supported

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.43. hlml_return_t hlml_device_get_memory_affinity(hlml_device_t device, unsigned int node_set_size, unsigned long *node_set);

Operation:

Retrieves an array of unsigned longs (sized to cpu_set_size) of bitmasks with the ideal memory affinity within node or socket for the device. For example, if NUMA node 0, 1 are ideal within the socket for the device and node_set_size == 1, result[0] = 0x3.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP

node_set_size

[in] The size of the node_set array that is safe to access

node_set

[out] Array reference in which to return a bitmask of NODEs, 64 NODEs per unsigned long on 64-bit machines, 32 on 32-bit machines

scop

[in] Scope that change the default behavior

Return value:

  • HLML_SUCCESS if managed to retrieve the ports masks.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or node_set is NULL.

  • HLML_ERROR_NOT_SUPPORTED if scope is not supported

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.44. hlml_return_t hlml_device_set_cpu_affinity(hlml_device_t device);

Operation:

Sets the ideal affinity for the calling thread and device using the guidelines given in hlml_device_clear_cpu_affinity().

Parameters

Parameter

Description

device

[in] The identifier of the target AIP

Return value:

  • HLML_SUCCESS if managed to retrieve the ports masks.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.45. hlml_return_t hlml_device_get_violation_status(hlml_device_t device, hlml_perf_policy_type_t perf_policy_type, hlml_violation_time_t *viol_time)

Operation:

Gets the duration of time during which the device was throttled (lower than requested clocks) due to power or thermal constraints.

The method is important to users who are trying to understand if their AIPs throttle at any point during their applications. If the event is currently in progress, then the duration is measured from the start until up to this point in time.

mice

typedef enum hlml_perf_policy_type {
HLML_PERF_POLICY_POWER,

HLML_PERF_POLICY_THERMAL,

} hlml_perf_policy_type_t;
typedef struct hlml_violation_time {
    unsigned long long  reference_time- micro; - 19898.83seconds - 20811.57seonds

unsigned long long  violation_time-nano; 18446724174879447172= 1.844672417e+10 seconds  duration of the event

} hlml_violation_time_t;

reference_time - represents CPU timestamp in microseconds - time of the start of the event (unique for each event). violation_time - indicates the duration of the event in nanoseconds.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP

perf_policy_type

[in] Represents performance policy which can trigger AIP throttling

viol_time

[out] Reference to which violation time related information is returned

Return value:

  • HLML_SUCCESS if managed to retrieve the violation info.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or viol_time is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.46. hlml_return_t hlml_device_get_replaced_rows(hlml_device_t device, hlml_row_replacement_cause_t cause, unsigned int *row_count, hlml_row_address_t *addresses)

Operation:

Returns the list of replaced rows, including rows that are pending replacement. The address information provided from this API is the full address of the row that was retired (see struct below).

typedef enum hlml_row_replacement_cause {
 HLML_ROW_REPLACEMENT_CAUSE_MULTIPLE_SINGLE_BIT_ECC_ERRORS,

 HLML_ROW_REPLACEMENT_CAUSE_DOUBLE_BIT_ECC_ERROR,
} hlml_row_replacement_cause_t;

Row address info struct:

typedef struct hlml_row_address {
uint8_t hbm_idx;

uint8_t pc;

uint8_t sid;

uint8_t bank_idx;

uint16_t row_addr;

} hlml_row_address_t;

Parameters

Parameter

Description

device

[in] The identifier of the target AIP

cause

[in] Filter replaced rows by cause of retirement

row_count

[in] Reference in which to provide the addresses buffer size. Set to 0 to query the size without allocating an addresses buffer. [out] Reference to which the number of replaced rows that match ‘cause’ will be returned.

addresses

[out] Buffer to write the row addresses into. Should be set to NULL if ‘row_count’ is set to 0.

Return value:

  • HLML_SUCCESS if row_count was updated and addresses were populated.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, row_count is NULL or row_count is not 0 while addresses is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if row_count indicates that the addresses buffer is not large enough to store all the matching replaced rows. row_count will be set to the required size.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.3.47. hlml_return_t hlml_device_get_replaced_rows_pending_status(hlml_device_t device, hlml_enable_state_t *is_pending)

Operation:

Checks if any rows that are pending replacement require a reboot to be replaced.

Parameters

Parameter

Description

device

[in] The identifier of the target AIP

is_pending

[out] Reference in which to return the pending status

Return value:

  • HLML_SUCCESS if is_pending was updated.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or is_pending is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

2.4. Linkage HLML

For linking HLML static library, use whole-archive parameter. See the below example:

-Wl,--whole-archive libhlml.a -Wl,--no-whole-archive