Per Device APIs

hlml_return_t hlml_device_get_handle_by_pci_bus_id ( const char *pci_addr, hlml_device_t* device )

Operation:

Acquires the handle for a device, based on its PCI address.

Parameters:

Parameter

Description

pci_addr

[in] The bus ID of the target AIP (The tuple domain:bus:device.function).

device

[out] Reference in which to return the device handle.

Return Value:

  • HLML_SUCCESS if device has been found.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if pci_addr is invalid or device is NULL.

  • HLML_ERROR_NOT_FOUND if the PCI address does not exist.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_handle_by_index ( unsigned int index, hlml_device_t* device )

Operation:

Acquires the handle for a device, based on its index.

Parameters:

Parameter

Description

index

[in] Index is a valid integer {x} of existing entry /dev/hl{x} in file system (index of a device that was successfully initialized by the driver).

device

[out] Reference in which to return the device handle.

Return Value:

  • HLML_SUCCESS if device has been found.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if index is invalid or device is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_handle_by_UUID ( const char *uuid, hlml_device_t* device )

Operation:

Acquires the handle for a device, based on its UUID.

Parameters:

Parameter

Description

uuid

[in] The UUID of the target AIP.

device

[out] Reference in which to return the device handle.

Return Value:

  • HLML_SUCCESS if device has been found.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if UUID is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_name ( hlml_device_t device, char* name, unsigned int  length)

Operation:

Retrieves the name of this AIP.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

name

[out] Reference in which to return the product name.

length

[in] The maximum allowed length of the string returned in name.

Return Value:

  • HLML_SUCCESS if name has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or name is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_pci_info ( hlml_device_t device, hlml_pci_info_t *pci )

Operation:

Retrieves the PCI attributes of this AIP.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

pci

[out] Reference in which to return the PCI info.

#define PCI_LINK_INFO_LEN 10

typedef struct hlml_pci_cap {

    char link_speed[PCI_LINK_INFO_LEN]; // current pci link speed

    char link_width[PCI_LINK_INFO_LEN]; // current pci link width

} html_pci_cap_t;

typedef struct hlml_pci_info {

    unsigned int bus; // The bus on which the device resides, 0 to 0xff

    char bus_id[PCI_ADDR_LEN]; // The tuple domain:bus:device.function

    unsigned int device; // The device's id on the bus, 0 to 31.

    unsigned int domain; // The PCI domain on which the device's bus resides

    unsigned int pci_device_id; //The combined 16b deviceId and 16b vendor id

    hlml_pci_cap_t caps; //The device capabilities

} hlml_pci_info_t;

Return Value:

  • HLML_SUCCESS if PCI has been populated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or PCI is NULL.

  • HLML_ERROR_AIP_IS_LOST if PCI data is missing.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_clock_info ( hlml_device_t device, hlml_clock_type_t type, unsigned int *clock )

Operation:

Retrieves the current clock speeds of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

type

[in] Identify which clock to query.

clock

[out] Reference in which to return the clock speed in MHz.

typedef enum hlml_clock_type {

    HLML_CLOCK_SOC = 0,

    HLML_CLOCK_IC = 1,

    HLML_CLOCK_MME = 2, // Not supported

    HLML_CLOCK_TPC = 3, // Not supported

    HLML_CLOCK_COUNT

} hlml_clock_type_t;

Return Value:

  • HLML_SUCCESS if clock has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device or type is invalid.

  • HLML_ERROR_NOT_SUPPORTED if clock is NULL.

  • HLML_ERROR_AIP_IS_LOST if the target AIP has fallen off the bus or is otherwise inaccessible.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_max_clock_info ( hlml_device_t device, hlml_clock_type_t type, unsigned int *clock )

Operation:

Retrieves the maximum clock speeds of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

type

[in] Identify which clock to query.

clock

[out] Reference in which to return the max clock speed in MHz.

typedef enum hlml_clock_type {

    HLML_CLOCK_SOC = 0,

    HLML_CLOCK_IC = 1,

    HLML_CLOCK_MME = 2,

    HLML_CLOCK_TPC = 3,

    HLML_CLOCK_COUNT

} hlml_clock_type_t;

Return Value:

  • HLML_SUCCESS if clock has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device or type is invalid or clock is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_clock_limit_info ( hlml_device_t device, hlml_clock_type_t type, unsigned int *clock )

Operation:

Retrieves the clock frequency limit speeds of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

type

[in] Identify which clock to query.

clock

[out] Reference in which to return the max clock speed in MHz.

typedef enum hlml_clock_type {

    HLML_CLOCK_SOC = 0,

    HLML_CLOCK_IC = 1,

    HLML_CLOCK_MME = 2,

    HLML_CLOCK_TPC = 3,

    HLML_CLOCK_COUNT

} hlml_clock_type_t;

hlml_return_t hlml_device_get_utilization_rates ( hlml_device_t device, hlml_utilization_t *utilization )

Operation:

Returns the utilization over the past second, in percentage, during which one or more kernels was running on the AIP.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

utilization

[out] Reference in which to return the utilization information.

typedef struct hlml_utilization {

    unsigned int aip;

} hlml_utilization_t;

Return Value:

  • HLML_SUCCESS If utilization has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or utilization is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

hlml_return_t hlml_device_get_memory_info ( hlml_device_t device, hlml_memory_t *memory )

Operation:

Retrieves the total, used and free memory.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

memory

[out] Reference in which to return the memory information.

typedef struct hlml_memory {

    unsigned long long free; // Free memory (in bytes)

    unsigned long long total; // Total installed memory (in bytes)

    unsigned long long used; // Used memory (in bytes)

} hlml_memory_t;

Return Value:

  • HLML_SUCCESS if memory has been populated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or memory is NULL, or if HLML_TEMPERATURE_OTHER is used.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_temperature ( hlml_device_t device, hlml_temperature_sensors_t sensor_type, unsigned int *temp )

Operation:

Retrieves the current temperature of the higher sensor_type, in degrees C.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

sensor_type

[in] Flag that indicates if the sensor is on the AIP or on the board.

temp

[out] Reference in which to return the temperature reading.

typedef enum hlml_temperature_sensors {

    HLML_TEMPERATURE_ON_AIP = 0,

    HLML_TEMPERATURE_ON_BOARD = 1,

    HLML_TEMPERATURE_OTHER = 2,

    HLML_TEMPERATURE_HBM = 3,

} hlml_temperature_sensors_t;

Return Value:

  • HLML_SUCCESS if temp has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, sensorType is invalid or temp is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_temperature_threshold ( hlml_device_t device, hlml_temperature_thresholds_t threshold_type, unsigned int *temp )

Operation:

Retrieves the known temperature threshold for the AIP with the specified threshold type in degrees C. Currently, this is a hard-coded value for all the types.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

Threshold_type

[in] The type of threshold value queried.

temp

[out] Reference in which to return the temperature reading.

typedef enum hlml_temperature_thresholds {

    HLML_TEMPERATURE_THRESHOLD_SHUTDOWN = 0,

    HLML_TEMPERATURE_THRESHOLD_SLOWDOWN = 1,

    HLML_TEMPERATURE_THRESHOLD_MEM_MAX = 2,

    HLML_TEMPERATURE_THRESHOLD_GPU_MAX = 3,

    HLML_TEMPERATURE_THRESHOLD_COUNT

} hlml_temperature_thresholds_t;

Return Value:

  • HLML_SUCCESS if temp has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, threshold_type is invalid or temp is NULL.

hlml_return_t hlml_device_get_persistence_mode ( hlml_device_t device, hlml_enable_state_t *mode )

Operation:

API is not supported.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

mode

[out] Reference in which to return the current driver persistence mode.

typedef enum hlml_enable_state {

    HLML_FEATURE_DISABLED = 0,

    HLML_FEATURE_ENABLED = 1

} hlml_enable_state_t;

Return Value:

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or mode is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

hlml_return_t hlml_device_get_performance_state ( hlml_device_t device, hlml_p_states_t *p_state )

Operation:

API is not supported.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

p_state

[out] Reference in which to return the performance state reading.

typedef enum hlml_p_states {

    HLML_PSTATE_0 = 0,

    HLML_PSTATE_UNKNOWN = 32

} hlml_p_states_t;

Return Value:

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or p_state is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

hlml_return_t hlml_device_get_power_usage ( hlml_device_t device, unsigned int *power )

Operation:

Retrieves power usage for this AIP in milliwatts and its associated circuitry (e.g. memory).

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

power

[out] Reference in which to return the power usage information.

Return Value:

  • HLML_SUCCESS if power has been populated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or power is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_power_management_mode ( hlml_device_t device, hlml_enable_state_t *state )

Operation:

Retrieves the power management mode associated with this device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

mode

[out] Reference in which to return the current power management mode.

Return Value:

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or mode is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_set_power_management_limit ( hlml_device_t device, unsigned int limit )

Operation:

Sets power management limit on this device, in milliwatts.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

limit

[in] Power management limit in milliwatts to set.

Return Value:

  • HLML_SUCCESS if the limit has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_power_management_limit ( hlml_device_t device, unsigned int *limit )

Operation:

Retrieves the power management limit on this device, in milliwatts. ⁠ The power management limit defines the upper boundary for the card’s power draw. If the card’s total power draw reaches this limit, the power management algorithm is triggered.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

limit

[out] Reference in which to return the power management limit in milliwatts.

Return Value:

  • HLML_SUCCESS if limit has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or limit is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_power_management_default_limit ( hlml_device_t device, unsigned int *default_limit )

Operation:

Retrieves default power management limit on this device, in milliwatts. The default power management limit is a power management limit that the device boots with.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

default_limit

[out] Reference in which to return the default power management limit in milliwatts.

Return Value:

  • HLML_SUCCESS if default limit has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or default_limit is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_ecc_mode ( hlml_device_t device, hlml_enable_state_t* current, hlml_enable_state_t* pending )

Operation:

Retrieves the current and pending ECC modes of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

current

[out] Reference in which to return the current ECC mode.

pending

[out] Reference in which to return the pending ECC mode.

typedef enum hlml_enable_state {

    HLML_FEATURE_DISABLED = 0,

    HLML_FEATURE_ENABLED = 1

} hlml_enable_state_t;

Return Value:

  • HLML_SUCCESS if ECC mode is set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or current or pending are NULL.

hlml_return_t hlml_device_get_total_ecc_errors ( hlml_device_t device, hlml_memory_error_type_t error_type, hlml_ecc_counter_type_t counter_type, unsigned long long* ecc_counts )

Operation:

Returns the number of ECC errors for a specific device, since the last device reset, or since the driver was installed. Only the number of uncorrected errors is supported.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

error_type

[in] Flag that specifies the type of the errors.

counter_type

[in] Flag that specifies the countertype of the errors.

ecc_counts

[out] Reference in which to return the specified ECC errors.

typedef enum hlml_memory_error_type {

    HLML_MEMORY_ERROR_TYPE_CORRECTED = 0, // Not supported

    HLML_MEMORY_ERROR_TYPE_UNCORRECTED = 1,

    HLML_MEMORY_ERROR_TYPE_COUNT

} hlml_memory_error_type_t;

enum hlml_ecc_counter_type {

    HLML_VOLATILE_ECC = 0, // Since last device reset

    HLML_AGGREGATE_ECC = 1, // Since driver is up

    HLML_ECC_COUNTER_TYPE_COUNT

} hlml_ecc_counter_type_t;

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device, error type or counter type is invalid, or ecc counts is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

  • HLML_ERROR_UNKNOWN if error occurred during ECC error retrieval.

hlml_return_t hlml_device_get_memory_error_counter( hlml_device_t device, hlml_memory_error_type_t error_type, hlml_ecc_counter_type_t counter_type, hlml_memory_location_type_t location, unsigned long long *ecc_counts)

Operation:

Returns the number of ECC errors for a specific device and location, since the last device reset, or since the driver was installed.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

error_type

[in] Flag that specifies the type of errors.

counter_type

[in] Flag that specifies the countertype of the errors.

location

[in] Flag that specifies the location of the errors.

ecc_counts

[out] Reference in which to return the specified ECC errors.

typedef enum hlml_memory_error_type {

    HLML_MEMORY_ERROR_TYPE_CORRECTED = 0, // Not supported

    HLML_MEMORY_ERROR_TYPE_UNCORRECTED = 1,

    HLML_MEMORY_ERROR_TYPE_COUNT

} hlml_memory_error_type_t;

enum hlml_ecc_counter_type {

    HLML_VOLATILE_ECC = 0, // Since last device reset

    HLML_AGGREGATE_ECC = 1, // Since driver is up

    HLML_ECC_COUNTER_TYPE_COUNT

} hlml_ecc_counter_type_t;

typedef enum hlml_memory_location_type {

    HLML_MEMORY_LOCATION_SRAM = 0,

    HLML_MEMORY_LOCATION_DRAM = 1,

    HLML_MEMORY_LOCATION_COUNT
} hlml_memory_location_type_t;

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device, error type or counter type is invalid, or ecc counts is NULL.

  • HLML_ERROR_NOT_SUPPORTED if the device does not support this feature.

  • HLML_ERROR_UNKNOWN if error occurred during ECC error retrieval.

hlml_return_t hlml_device_get_uuid(hlml_device_t device, char *uuid, unsigned int length)

Operation:

Returns the UUID for the device as string.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

uuid

[out] The UUID.

length

[in] The maximum size of the string allocated by the user.

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or UUID is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if the UUID string length is longer than the size allocated by the user.

hlml_return_t hlml_device_get_minor_number(hlml_device_t device, unsigned int *minor_number)

Operation:

Retrieves the minor number of the device. The minor number of the device is such that the Gaudi device node file for each device will have the following form: /sys/class/habanalabs/hl[minor number].

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

minor_number

[out] Reference in which to return the minor number for the device.

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or minor_number is NULL.

  • HLML_ERROR_NO_DATA if unable to retrieve the minor number for any reason.

hlml_return_t hlml_event_set_create(hlml_event_set_t *set)

Operation:

Creates an empty set of events. Event set should be freed by hlml_event_set_free.

Parameters:

Parameter

Description

set

[out] Reference in which to return the event handle.

typedef void\* hlml_event_set_t;

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if set is NULL.

  • HLML_ERROR_MEMORY if failed to allocate a set.

hlml_return_t hlml_event_set_free(hlml_event_set_t set)

Operation:

Releases a set of events.

Parameters:

Parameter

Description

set

[in] Reference to events to be released.

Return Value:

  • HLML_SUCCESS if ecc_counts was allocated.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if set is invalid.

hlml_return_t hlml_device_register_events(hlml_device_t device, unsigned long long event_types, hlml_event_set_t set)

Operation:

Starts recording of events on specified devices and add the events to the specified hlml_event_set_t. This call starts recording of events on a specific device. All events that occurred before this call are not recorded.

Supported events:

  • ECC single/double bit errors – BIT(0)

  • Critical errors that occurred – BIT(1)

  • Clock rate changes – BIT(2)

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

event_types

[in] Bitmask of event types to record.

set

[in] Set to which add new event types.

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device or set is invalid or event_types is 0.

  • HLML_ERROR_UNKNOWN if the failed to retrieve information regarding events.

hlml_return_t hlml_event_set_wait(hlml_event_set_t set, hlml_event_data_t *data, unsigned int timeoutms)

Operation:

Waits on events and delivers events. If some events are ready to be delivered at the time of the call, function returns immediately. If there are no events ready to be delivered, function sleeps until the event arrives but not longer than the specified timeout.

Parameters:

Parameter

Description

set

[in] Reference to set of events to wait on.

Data

[out] Reference in which to return event data.

timeoutms

[in] Maximum amount of wait time in milliseconds for registered event.

typedef struct hlml_event_data {

    hlml_device_t device; /* Specific device where the event occurred. */

    unsigned long long event_type; /* Specific event that occurred */

} hlml_event_data_t;

Return Value:

  • HLML_SUCCESS if ecc_counts was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if set is invalid or data is NULL.

  • HLML_ERROR_UNKNOWN if the failed to retrieve information regarding events.

  • HLML_ERROR_TIMEOUT if we did not get any events during timeout ms.

hlml_return_t hlml_device_get_mac_info(hlml_device_t device, hlml_mac_info_t *mac_info, unsigned int mac_info_size, unsigned int start_mac_id, unsigned int *actual_mac_info_count)

Operation:

Gets MAC addresses of device.

#define ETHER_ADDR_LEN 6

typedef struct hlml_mac_info {

    unsigned char addr[ETHER_ADDR_LEN];

    int id;

} hlml_mac_info_t;

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

mac_info

[out] Array of MAC addresses allocated by the user.

mac_info_size

[in] Number of elements in mac_info[].

start_mac_id

[in] MAC id to start from. Number in the range of [1…20].

actual_mac_info_count

[out] The actual number of elements that we set in the user’s array (mac_info).

Return Value:

  • HLML_SUCCESS if MAC address was retrieved successfully.

  • HLML_ERROR_INVALID_ARGUMENT if device/mac_info/actual_mac_info_count are invalid. Or if start_mac_id is <1 or >20.

  • HLML_ERROR_NO_DATA if requested start MAC address is bigger than the MAC count for the device.

hlml_return_t hlml_device_get_hl_revision(hlml_device_t device, int *hl_revision)

Operation:

Gets the HL Revision.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

hl_revision

[out] Reference in which to return the hl_revision number.

Return Value:

  • HLML_SUCCESS if hl_revision has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or hl_revision is NULL.

  • HLML_ERROR_NOT_FOUND if failed to retrieve hl_revision.

hlml_return_t hlml_device_get_pcb_info(hlml_device_t device, hlml_pcb_info_t *pcb)

Operation:

Gets the PCB info.

#define HL_FIELD_MAX_SIZE 32

/*
 * pcb_ver - The device's PCB version
 * pcb_assembly_ver - The device's PCB Assembly version
 */

typedef struct hlml_pcb_info {

    char pcb_ver[HL_FIELD_MAX_SIZE];

    char pcb_assembly_ver[HL_FIELD_MAX_SIZE];

} hlml_pcb_info_t;

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

pcb

[out] Reference in which to return the PCB info.

Return Value:

  • HLML_SUCCESS if PCB info has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or pcb is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_serial(hlml_device_t device, char *serial, unsigned int length)

Operation:

Retrieves the globally unique board serial number associated with this device’s board.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

serial

[out] Reference in which to return the board/module serial number.

length

[in] The maximum allowed length of the string returned in serial.

Return Value:

  • HLML_SUCCESS if serial has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or serial is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

hlml_return_t hlml_device_get_module_id(hlml_device_t device, unsigned int *module_id)

Operation:

Retrieves the module id configured on the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

module_id

[out] Reference in which to return the module id.

Return Value:

  • HLML_SUCCESS if serial has been set.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or module_id is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_board_id(hlml_device_t device, unsigned int* board_id)

Operation:

Retrieves the device boardId from 0-7.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

board_id

[out] Reference in which to return the device’s board ID.

Return Value:

  • HLML_SUCCESS if board_id has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or board_id is NULL.

  • HLML_ERROR_NOT_FOUND if no AIP matching device was found.

hlml_return_t hlml_device_get_pcie_throughput(hlml_device_t device, hlml_pcie_util_counter_t counter, unsigned int *value);

Operation:

Retrieves PCIe utilization information. This function is querying PCIe throughput that was calculated over a 10ms interval.

typedef enum hlml_pcie_util_counter {

    HLML_PCIE_UTIL_TX_BYTES = 0,

    HLML_PCIE_UTIL_RX_BYTES = 1,

    HLML_PCIE_UTIL_COUNT,

} hlml_pcie_util_counter_t;

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

counter

[in] The specific counter that should be queried.

value

[out] Reference in which to return throughput in KB/s.

Return Value:

  • HLML_SUCCESS if value has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device or counter is invalid, or value is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_pcie_replay_counter(hlml_device_t device, unsigned int *value)

Operation:

Retrieves the PCIe replay counter.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

value

[out] Reference in which to return the counter’s value.

Return Value:

  • HLML_SUCCESS if value has been set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device invalid, or value is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_current_clocks_throttle_reasons(hlml_device_t device, unsigned long long *clocks_throttle_reasons)

Operation:

Retrieves current clocks throttling reasons.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

clocks_throttle_reasons

[out] Reference in which to return bitmask of active clocks throttle reasons.

Return Value:

  • HLML_SUCCESS if clocks_throttle_reasons was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or clocks_throttle_reasons is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_total_energy_consumption(hlml_device_t device, unsigned long long *energy);

Operation:

Retrieves total energy consumption in millijoules (mJ) since the driver was last reloaded.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

energy

[out] Reference in which to return energy consumption.

Return Value:

  • HLML_SUCCESS if clocks_throttle_reasons was set.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or clocks_throttle_reasons is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_get_mac_addr_info(hlml_device_t device, uint64_t *mask, uint64_t *ext_mask);

Operation:

Retrieves the masks for supported ports and external ports.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

mask

[out] Array of 2 uint64_t. Reference in which to return a bitmask for supported ports.

ext_mask

[out] Array of 2 uint64_t. Reference in which to return a bitmask for external ports within the supported ports. The internal ports bitmask will be the difference between the mask and ext_mask.

Return Value:

  • HLML_SUCCESS if the ports masks were retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or mask is NULL or ext_mask is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_nic_get_statistics(hlml_device_t device, hlml_nic_stats_info_t *stats_info);

Operation:

Retrieves the NICs statistics for the requested internal ports.

typedef struct hlml_nic_stats_info {
        uint32_t port;
    char *str_buf;
    uint64_t *val_buf;
    uint32_t *num_of_counters_out;
} hlml_nic_stats_info_t;

 #define HABANA_LINK_CNT_MAX_NUM 256

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

stats_info.port

[in] Port for which the statistics are requested.

stats_info.str_buf

[out] Reference in which the names of the counters are stored. Buffer is allocated by the user in the size of HABANA_LINK_CNT_MAX_NUM * 32.

stats_info.val_buf

[out] Reference in which the names of the counters are stored. Buffer is allocated by the user in the size of HABANA_LINK_CNT_MAX_NUM * sizeof (uint64_t).

stats_info.num_of _counters_out

[out] Reference in which the actual number of counters retrieved is stored.

Return Value:

  • HLML_SUCCESS if the NICs statistics were retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, stats_info.str_buf is NULL, stats_info.val_buf is NULL or stats_info.num_of_counters_out is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_clear_cpu_affinity(hlml_device_t device);

Operation:

Clears all affinity bindings for the calling process.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

Return Value:

  • HLML_SUCCESS if the ports masks were retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_cpu_affinity(hlml_device_t device, unsigned int cpu_set_size, unsigned long *cpu_set);

Operation:

Retrieves an array of unsigned longs (sized to cpu_set_size) of bitmasks with the ideal CPU affinity for the device. For example (64 bit machine), if processors 0, 1, 64, and 65 are ideal for the device and cpuSetSize == 2, result[0] = 0x3, result[1] = 0x3. This is equivalent to calling hlml_device_get_cpu_affinity_within_scope with HLML_AFFINITY_SCOPE_NODE.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

cpu_set_size

[in] The size of the cpu_set array that is safe to access.

cpu_set

[out] Array reference in which to return a bitmask of CPUs, 64 CPUs per unsigned long on 64-bit machines, 32 on 32-bit machines.

Return Value:

  • HLML_SUCCESS if the ports masks were retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or cpu_set is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_cpu_affinity_within_scope(hlml_device_t device, unsigned int cpu_set_size, unsigned long *cpu_set, hlml_affinity_scope_t scope);

Operation:

Retrieves an array of unsigned ints (sized to cpu_set_size) of bitmasks with the ideal CPU affinity within node or socket for the device. For example (64 bit machine), if processors 0, 1, 64, and 65 are ideal for the device and cpuSetSize == 2, result[0] = 0x3, result[1] = 0x3.

#define HLML_AFFINITY_SCOPE_NODE 0   //Scope of NUMA node for affinity queries.
#define HLML_AFFINITY_SCOPE_SOCKET 1 //Scope of processor socket for affinity queries

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

cpu_set_size

[in] The size of the cpu_set array that is safe to access.

cpu_set

[out] Array reference in which to return a bitmask of CPUs, 64 CPUs per unsigned long on 64-bit machines, 32 on 32-bit machines.

scop

[in] Scope that changes the default behavior.

Return Value:

  • HLML_SUCCESS if the ports masks were retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or cpu_set is NULL.

  • HLML_ERROR_NOT_SUPPORTED if scope is not supported

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_memory_affinity(hlml_device_t device, unsigned int node_set_size, unsigned long *node_set, hlml_affinity_scope_t scope);

Operation:

Retrieves an array of unsigned longs (sized to cpu_set_size) of bitmasks with the ideal memory affinity within node or socket for the device. For example, if NUMA node 0, 1 are ideal within the socket for the device and node_set_size == 1, result[0] = 0x3.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

node_set_size

[in] The size of the node_set array that is safe to access.

node_set

[out] Array reference in which to return a bitmask of NODEs, 64 NODEs per unsigned long on 64-bit machines, 32 on 32-bit machines.

scope

[in] Scope that change the default behavior.

Return Value:

  • HLML_SUCCESS if ports masks were retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or node_set is NULL.

  • HLML_ERROR_NOT_SUPPORTED if scope is not supported

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_set_cpu_affinity(hlml_device_t device);

Operation:

Sets the ideal affinity for the calling thread and device using the guidelines given in hlml_device_clear_cpu_affinity().

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

Return Value:

  • HLML_SUCCESS if the ports masks were retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_violation_status(hlml_device_t device, hlml_perf_policy_type_t perf_policy_type, hlml_violation_time_t *viol_time)

Operation:

Gets the duration of time during which the device was throttled (lower than requested clocks) due to power or thermal constraints.

The method helps to understand if the AIPs throttle at any point during the applications. If the event is currently in progress, then the duration is measured from the start until up to this point in time.

typedef enum hlml_perf_policy_type {
HLML_PERF_POLICY_POWER,

HLML_PERF_POLICY_THERMAL,

} hlml_perf_policy_type_t;
typedef struct hlml_violation_time {
    unsigned long long  reference_time- micro; - 19898.83seconds - 20811.57seonds

unsigned long long  violation_time-nano; 18446724174879447172= 1.844672417e+10 seconds  duration of the event

} hlml_violation_time_t;
  • reference_time represents CPU timestamp in microseconds - time of the start of the event (unique for each event).

  • violation_time indicates the duration of the event in nanoseconds.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

perf_policy_type

[in] Represents performance policy which can trigger AIP throttling.

viol_time

[out] Reference to which violation time related information is returned.

Return Value:

  • HLML_SUCCESS if the violation info was retrieved.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, or viol_time is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_replaced_rows(hlml_device_t device, hlml_row_replacement_cause_t cause, unsigned int *row_count, hlml_row_address_t *addresses)

Operation:

Returns the list of replaced rows, including rows that are pending replacement. The address information provided from this API is the full address of the row that was retired (see struct below).

typedef enum hlml_row_replacement_cause {
 HLML_ROW_REPLACEMENT_CAUSE_MULTIPLE_SINGLE_BIT_ECC_ERRORS,

 HLML_ROW_REPLACEMENT_CAUSE_DOUBLE_BIT_ECC_ERROR,
} hlml_row_replacement_cause_t;

Row address info struct:

typedef struct hlml_row_address {
uint8_t hbm_idx;

uint8_t pc;

uint8_t sid;

uint8_t bank_idx;

uint16_t row_addr;

} hlml_row_address_t;

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

cause

[in] Filter replaced rows by cause of retirement.

row_count

[in] Reference in which to provide the addresses buffer size. Set to 0 to query the size without allocating an addresses buffer. [out] Reference to which the number of replaced rows that match ‘cause’ will be returned.

addresses

[out] Buffer to write the row addresses into. Should be set to NULL if ‘row_count’ is set to 0.

Return Value:

  • HLML_SUCCESS if row_count was updated and addresses were populated.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid, row_count is NULL or row_count is not 0 while addresses is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if row_count indicates that the addresses buffer is not large enough to store all the matching replaced rows. row_count will be set to the required size.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_replaced_rows_pending_status(hlml_device_t device, hlml_enable_state_t *is_pending)

Operation:

Checks if any rows that are pending replacement require a reboot to be replaced.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

is_pending

[out] Reference in which to return the pending status.

Return Value:

  • HLML_SUCCESS if is_pending was updated.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or is_pending is NULL.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_get_hlml_version(char *version, unsigned int length)

Operation:

Returns version of hlml library.

Parameters:

Parameter

Description

version

[out] Reference in which to return the version.

length

[in] The maximum allowed length of the string returned in version.

Return Value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

hlml_return_t hlml_get_driver_version(char *driver_version, unsigned int length)

Operation:

Returns version of Intel Gaudi kernel driver.

Parameters:

Parameter

Description

driver_version

[out] Reference in which to return the version.

length

[in] The maximum allowed length of the string returned in driver_version.

Return Value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_get_model_number(hlml_device_t device, char *model_number, unsigned int length)

Operation:

Returns model number of the device.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

model_number

[out] Reference in which to return the model number

length

[in] The maximum allowed length of the string returned in model_number.

Return Value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_NO_DATA if unable to retrieve the model number for any reason.

hlml_return_t hlml_get_firmware_fit_version(hlml_device_t device, char *firmware_fit, unsigned int length)

Operation:

Returns version of fit component of the firmware.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

firmware_fit

[out] Reference in which to return the version.

length

[in] The maximum allowed length of the string returned in firmware_fit.

Return Value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_get_firmware_spi_version(hlml_device_t device, char *firmware_spi, unsigned int length)

Operation:

Returns version of the firmware SPI.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

firmware_spi

[out] Reference in which to return the version.

length

[in] The maximum allowed length of the string returned in firmware_spi.

Return Value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_get_fw_boot_version(hlml_device_t device, char *fw_boot_version, unsigned int length)

Operation:

Returns version of the firmware U-Boot.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

fw_boot_version

[out] Reference in which to return the version.

length

[in] The maximum allowed length of the string returned in fw_boot_version.

Return Value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_get_fw_os_version(hlml_device_t device, char *fw_os_version, unsigned int length)

Operation:

Returns version of operating system of the firmware.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

fw_os_version

[out] Reference in which to return the version.

length

[in] The maximum allowed length of the string returned in fw_os_version.

Return value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_get_cpld_version(hlml_device_t device, char *cpld_version, unsigned int length)

Operation:

Returns version of the device CPLD.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP.

cpld_version

[out] Reference in which to return the version.

length

[in] The maximum allowed length of the string returned in cpld_version.

Return Value:

  • HLML_SUCCESS if version has been set.

  • HLML_ERROR_INVALID_ARGUMENT if version is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.

hlml_return_t hlml_device_get_oper_status(hlml_device_t device, char *status, unsigned int length)

Operation:

Retrieves AIP’s status. The status is a descriptive string. Available values are (case insensitive):

  • operational

  • in reset

  • disabled

  • need reset

  • in device creation

  • in reset after device release

The above values are defined by the LKD and the list might be extended in the future.

Parameters:

Parameter

Description

device

[in] The identifier of the target AIP

status

[out] Reference in which the descriptive string will be copied to.

length

[in] The maximum allowed length of the string returned in name.

Return Value:

  • HLML_SUCCESS if operation status was read correctly.

  • HLML_ERROR_UNINITIALIZED if the library has not been successfully initialized.

  • HLML_ERROR_INVALID_ARGUMENT if device is invalid or status is NULL.

  • HLML_ERROR_INSUFFICIENT_SIZE if length is too small.

  • HLML_ERROR_UNKNOWN on any unexpected error.