Implementing and Integrating New lib


Implement the following components in order to add a new lib that contains your implemented kernel:

  • Kernels

  • Glue Code

  • Tests (optional)

For the complete code examples, please visit - Intel Gaudi Custom Kernel.


The kernel is written in TPC-C language as described in TPC Programming Language. The kernel is a main function and its signature contains a list of parameters. Tensors and scalars are parameters with some restrictions.

The following is an example of a simple kernel:

void main(tensor inputA, tensor outputB)
    const int dim0 = 0;
    const int dim1 = 1;
    const int dim2 = 2;
    const int dim3 = 3;
    // Index space coordinates
    const int5 idx_s = get_index_space_offset();
    const int5 idx_e = get_index_space_size() + idx_s;
    int5 ifmCoords = {0, 0, 0, 0, 0};

    float64 in,out;
    for (int idx0 = idx_s[dim0]*64; idx0 < idx_e[dim0] * 64; idx0 += 64)
        ifmCoords[dim0] = idx0;
        for (int idx3 = idx_s[dim3]; idx3 < idx_e[dim3]; idx3 += 1)
            ifmCoords[dim3] = idx3;
            for (int idx2 = idx_s[dim2]; idx2 < idx_e[dim2]; idx2 += 1)
                ifmCoords[dim2] = idx2;
                for (int idx1 = idx_s[dim1]; idx1 < idx_e[dim1]; idx1 += 1)
                    ifmCoords[dim1] = idx1;
                    in = v_f32_ld_tnsr_i(ifmCoords, inputA);
                    out = v_f32_abs_v(in);
                    f32_st_tnsr_i_v(ifmCoords, outputB, out);

Glue Code

The program and its associated definition set is passed to the graph compiler to be incorporated into the DNN topology through a host side interface called Glue Code. The outer component (GraphCompiler) interacts with the new lib through two connectivity points:

  • GetKernelNames

  • HabanaKernel

An example of these two methods is found under entry_points.cpp.


gcapi::GlueCodeReturn_t GetKernelNames(_OUT_ char**          names,
                                        unsigned*            kernelCount,
                                        gcapi::DeviceId_t    deviceId);

The method returns the exported kernel names list. The kernel name must not exceed 64 bytes length.

  • names: [out] List of strings to be filled with kernel names.

  • kernelCount: [in/out].

    • [in] The maximum number of strings in ‘names’ argument.

    • [out] If the number of kernels <= maximum list length, copy the kernel names into the list(names) and update the number of kernels, otherwise just update the required list length.

  • DeviceId - [in] The type of device (an enum under tpc_kernel_lib_interface.h). Possible values:

    • tpc_lib_api::DEVICE_ID_GAUDI

    • tpc_lib_api::DEVICE_ID_GAUDI2


InstantiateTpcKernel(_IN_ const tpc_lib_api::HabanaKernelParams* params,
                     _OUT_ tpc_lib_api::HabanaKernelInstantiation* instance);
typedef struct _HabanaKernelParams
   _IN_    int        apiVersion;
   _IN_    DeviceId   deviceId;                /* asic ID */
   _IN_    GuidInfo   guid;                    /* GUID of node in the graph */
   _IN_    UserParams nodeParams;              /* Kernel specific parameters */
   _IN_    Tensor*    inputTensors;            /* array of the input tensor */
   _IN_    uint32_t   inputTensorNr;           /* the number of input tensors */
   _IN_    Tensor*    outputTensors;           /* array of the output tensor  */
   _IN_    uint32_t   outputTensorNr;          /* the number of output tensors. */
   _IN_    uint32_t   maxAvailableTpc;         /* The maximum amount of TPC engines the kernel will execute on.*/
   _IN_    uint32_t   useDeterministic;        /* directive to return deterministic version of program */
   _IN_    uint64_t   uniqueNodeId;            /* provided to be able to easily trace in the logs */
   _IN_    uint32_t   debugFlags;              /* for internal use.- used to debug/profile */
   _IN_    uint16_t   validInputTensors;       /* Valid input tensors bit mask */
   _IN_    uint16_t   validOutputTensors;      /* Valid output tensors bit mask */
         uint32_t   reserved[24];
} HabanaKernelParams;

The method is the new kernels lib main entry point.

  • params:[in] The kernel properties:

    • Requested kernel name and data type (e.g. maxpool_2d_i8 / averagepool_2d_f32 etc).

    • Number of input/output tensor for the kernels.

    • For each input/output tensor the graph compiler supplies:

      • Data type

      • Size in each dimension

      • Quantization parameters (scale /zero point)

typedef struct _HabanaKernelInstantiation
   _OUT_   uint32_t             indexSpaceRank;
   _OUT_   uint64_t             indexSpaceGeometry[MAX_INDEX_SPACE_DIM_SIZE];
   _OUT_   TensorAccessPattern* inputTensorAccessPattern;
   _OUT_   TensorAccessPattern* outputTensorAccessPattern;
   _INOUT_ AuxTensor*           auxiliaryTensors; // see comment below
   _INOUT_ uint32_t             auxiliaryTensorNr;
   _INOUT_ DeviceKernel         kernel;
   _OUT_   uint32_t             preferredSplitDim;  /* Set a dimension that GC code-gen pass must
                                                   partition with each partition size of 1.
                                                   This dimension must be set as allRequired within
                                                   DimIndexSpaceMapping for all tensors.
                                                   At the GC code_gen level (after slice pass),
                                                   GC must begin with this dimension with each
                                                   partition size of 1 and may continue to other
                                                   dimensions in case preferredSplitDim is smaller
                                                   than maxAvailableTpc
                                                   0 - Disabled feature (default),
                                                   Otherwise - (preferredSplitDim-1) is dim to split
                                                   E.g. - 1 means dim0 (FCD), etc.
   uint32_t                     reserved[15];
} HabanaKernelInstantiation;
  • instance:[out] Returned kernel final properties.

    • Program binary.

    • Size of index space as described in Index Space.

    • Index space mapping as described in Index Space Mapping.

    • Values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).

Glue code should perform the following:

  • Verify input/output tensors properties are correct (fits the kernel definition):

    • Input/output tensors count matches the kernel definition.

    • Input/output tensors dimensions matches the kernel definition.

    • Input/output tensors data type matches the kernel definition.

  • Return program binary.

  • Return size of index space as described in Index Space.

  • Return index space mapping as described in Index Space Mapping.

  • Return values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).

Build lib Project

Building the lib project requires the habanatools package. See installation instructions provided in the TPC Tools Installation Guide.

Upon successful compilation, the new lib is generated:

<build path>/builds/<Debug or Release>/src/lib<name> – the plugin shared object to be loaded by the Intel Gaudi software in production.


‘printf’ is a built-in utility function exposed by the TPC compiler to the TPC kernel writers. It enables entry level debugging capabilities in the TPC processor. Establishing an ABI between the compiler and Intel Gaudi runtime implements printf.


Printf syntax is identical to C runtime library syntax with the following restriction:

  • Printf accepts, at most, only one variable in addition to the message string.

To enable printf support, define the following pragma:

#pragma tpc_printf(enable)
  • Scalar printing - Similar to C library function- printf(“depth=%d “,depth);

  • Vector print - You can use a loop to print the whole vector or just part of it. For example, vector of floats (64 elements in a vector) for (int i=0; i<64; i++) { printf(“%f, “, vec[i]);}

The code below demonstrates the printing format:

#pragma tpc_printf(enable)

void printTest(void)
    char char_val = 0xff;
    unsigned char uchar_val = 0xff;
    short short_val = 0xb221;
    unsigned short ushort_val = 0xb221; //45,601
    int int_val = 0x8455CDD1;
    unsigned int uint_val = 0x8455CDD1; //2,220,215,761
    bf16 bf16_val = 46.25;
    float float_val = 15.23423;
    /*V_LANE_ID_32 vector, values 0-63 */
    uint64 vec_lane_id = V_LANE_ID_32;

    printf("Test string!\n");
    printf("char value is %hhd\n", char_val);
    printf("unsigend char value is %hhu\n", uchar_val);
    printf("short value is %hd\n", short_val);
    printf("unsigend short value is %hu\n", ushort_val);
    printf("int value is %d\n", int_val);
    printf("unsigend int value is %u\n", uint_val);

    printf("bfloat value is %bf\n", bf16_val);
    //printf("half float value is %hf\n", f16_val);
    printf("float value is %f\n", float_val);
    printf("Vector Print:\n");
    for (int i = 0; i < 64; i++)
        printf("%u, ", vec_lane_id[i]);

Example output:

Test string!
char value is -1
unsigend char value is 255
short value is -19935
unsigend short value is 45601
int value is -2074751535
unsigend int value is 2220215761
bfloat value is 46.250000
float value is 15.234230
Vector Print:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... ,  62, 63