Implementing and Integrating New lib


Implement the following components in order to add a new lib that contains your implemented kernel:

  • Kernels

  • Glue Code

  • Tests (optional)

For the complete code examples, please visit - Habana Custom Kernel.


The kernel is written in TPC-C language as described in TPC Programming Language. The kernel is a main function and its signature contains a list of parameters. Tensors and scalars are parameters with some restrictions.

The following is an example of a simple kernel:

void main(tensor inputA, tensor outputB)
    const int dim0 = 0;
    const int dim1 = 1;
    const int dim2 = 2;
    const int dim3 = 3;
    // Index space coordinates
    const int5 idx_s = get_index_space_offset();
    const int5 idx_e = get_index_space_size() + idx_s;
    int5 ifmCoords = {0, 0, 0, 0, 0};

    float64 in,out;
    for (int idx0 = idx_s[dim0]*64; idx0 < idx_e[dim0] * 64; idx0 += 64)
        ifmCoords[dim0] = idx0;
        for (int idx3 = idx_s[dim3]; idx3 < idx_e[dim3]; idx3 += 1)
            ifmCoords[dim3] = idx3;
            for (int idx2 = idx_s[dim2]; idx2 < idx_e[dim2]; idx2 += 1)
                ifmCoords[dim2] = idx2;
                for (int idx1 = idx_s[dim1]; idx1 < idx_e[dim1]; idx1 += 1)
                    ifmCoords[dim1] = idx1;
                    in = v_f32_ld_tnsr_i(ifmCoords, inputA);
                    out = v_f32_abs_v(in);
                    f32_st_tnsr_i_v(ifmCoords, outputB, out);

Glue Code

The program and its associated definition set is passed to the Graph Compiler to be incorporated into the DNN topology through a host side interface called Glue Code. The outer component (GraphCompiler) interacts with the new lib through two connectivity points:

  • GetKernelNames

  • HabanaKernel

An example of these two methods is found under entry_points.cpp.


gcapi::GlueCodeReturn_t GetKernelNames(_OUT_ char**          names,
                                        unsigned*            kernelCount,
                                        gcapi::DeviceId_t    deviceId);

The method returns the exported kernel names list. The kernel name must not exceed 64 bytes length.

  • names: [out] List of strings to be filled with kernel names.

  • kernelCount: [in/out].

    • [in] The maximum number of strings in ‘names’ argument.

    • [out] If the number of kernels <= maximum list length, copy the kernel names into the list(names) and update the number of kernels, otherwise just update the required list length.

  • DeviceId - [in] The type of device (an enum under gc_interface.h). Possible values:

    • gcapi::DEVICE_ID_GOYA

    • gcapi::DEVICE_ID_GAUDI


HabanaKernel(_IN_  const gcapi::HabanaKernelParams_t* params,
             _OUT_ gcapi::HabanaKernelInstantiation_t*instance)
typedef struct _HabanaKernelParams_t
    _IN_     int          apiVersion;
    _IN_     char         nodeName[MAX_NODE_NAME];
    _IN_     UserParams_t NodeParams;                    /* user specific. */
    _IN_     DeviceId_t   deviceId;                      /* asic ID */
    _IN_     KernelType_t kernelType;                    /* deprecated */
    _INOUT_  Tensor_t     inputTensors[MAX_TENSOR_NR];   /* array of the input tensor handles.*/
    _INOUT_  unsigned     inputTensorNr;                 /* the number of input tensors */
    _INOUT_  Tensor_t     outputTensors[MAX_TENSOR_NR];  /* array of the output tensor handles. */
    _INOUT_  unsigned     outputTensorNr;                /* the number of output tensors. */
    _IN_     unsigned     debugFlags;                    /* for internal use.- used to debug/profile
                                                          *  programs. */
    _IN_     unsigned     NodeParamsSize;                /* Size of struct pointed by NodeParams */
    _IN_     unsigned     maxAvailableTpc;               /* Kernel writer should know that it will get any number between 1 and maxAvailableTpc
                                                          * Kernels that rely on number of TPC in the index space should expose index space size with maxAvailableTpc
                                                          * Examples: Sparse segment sum, Embedding bag kernels, etc .. */
             unsigned     reserved[28];
} HabanaKernelParams_t;

The method is the new kernels lib main entry point.

  • params:[in] The kernel properties:

    • Requested kernel name and data type (e.g. maxpool_2d_i8 / averagepool_2d_f32 etc).

    • Number of input/output tensor for the kernels.

    • For each input/output tensor the Graph Compiler supplies:

      • Data type

      • Size in each dimension

      • Quantization parameters (scale /zero point)

typedef struct _HabanaKernelInstantiation_t
    _OUT_   TensorGeometry_t      indexSpaceGeometry;
    _OUT_   TensorAccessPattern_t inputTensorAccessPattern[MAX_TENSOR_NR];
    _OUT_   PadValue              inputPadValues[MAX_TENSOR_NR];
    _OUT_   TensorAccessPattern_t outputTensorAccessPattern[MAX_TENSOR_NR];
    _INOUT_ AuxTensor_t           auxiliaryTensors[MAX_TENSOR_NR];
    _OUT_   unsigned              auxiliaryTensorCount;
    _INOUT_ DeviceKernel_t        kernel;
    _OUT_   ProgramFlags          flags;
    _INOUT_ void*                 kernelElf;
    _INOUT_ unsigned              elfSize;
    _OUT_   PadValue              outputMemsetValues[MAX_TENSOR_NR];
    _OUT_   unsigned              auxNotRequiringInit; /* This is a bit mask is defining which aux
                                                        *  tensor should be regarded as SRAM
                                                        * scratch pad aux tensor*/
            unsigned              reserved[16];
} HabanaKernelInstantiation_t;
  • instance:[out] Returned kernel final properties.

    • Program binary.

    • Size of index space as described in Index Space.

    • Index space mapping as described in Index Space Mapping.

    • Values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).

    • Optionally, decide the pad value of the input tensors.

Glue code should perform the following:

  • Verify input/output tensors properties are correct (fits the kernel definition):

    • Input/output tensors count matches the kernel definition.

    • Input/output tensors dimensions matches the kernel definition.

    • Input/output tensors data type matches the kernel definition.

  • Return program binary.

  • Return size of index space as described in Index Space.

  • Return index space mapping as described in Index Space Mapping.

  • Return values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).

  • Optionally, decide the pad value of the input tensors.

Build lib Project

Building the lib project requires the habanatools package. See installation instructions provided in the TPC Tools Installation Guide.

Upon successful compilation, the new lib is generated:

<build path>/builds/<Debug or Release>/src/lib<name> – the plugin shared object to be loaded by SynapseAI in production.


‘printf’ is a built-in utility function exposed by the TPC compiler to the TPC kernel writers. It enables entry level debugging capabilities in the TPC processor. Establishing an ABI between the compiler and Habana runtime implements printf.


Printf syntax is identical to C runtime library syntax with the following restriction:

  • Printf accepts, at most, only one variable in addition to the message string.

To enable printf support, define the following pragma:

#pragma tpc_printf(enable)
  • Scalar printing - Similar to C library function- printf(“depth=%d “,depth);

  • Vector print - You can use a loop to print the whole vector or just part of it. For example, vector of floats (64 elements in a vector) for (int i=0; i<64; i++) { printf(“%f, “, vec[i]);}

The code below demonstrates the printing format:

#pragma tpc_printf(enable)

void printTest(void)
    char char_val = 0xff;
    unsigned char uchar_val = 0xff;
    short short_val = 0xb221;
    unsigned short ushort_val = 0xb221; //45,601
    int int_val = 0x8455CDD1;
    unsigned int uint_val = 0x8455CDD1; //2,220,215,761
    bf16 bf16_val = 46.25;
    float float_val = 15.23423;
    /*V_LANE_ID_32 vector, values 0-63 */
    uint64 vec_lane_id = V_LANE_ID_32;

    printf("Test string!\n");
    printf("char value is %hhd\n", char_val);
    printf("unsigend char value is %hhu\n", uchar_val);
    printf("short value is %hd\n", short_val);
    printf("unsigend short value is %hu\n", ushort_val);
    printf("int value is %d\n", int_val);
    printf("unsigend int value is %u\n", uint_val);

    printf("bfloat value is %bf\n", bf16_val);
    //printf("half float value is %hf\n", f16_val);
    printf("float value is %f\n", float_val);
    printf("Vector Print:\n");
    for (int i = 0; i < 64; i++)
        printf("%u, ", vec_lane_id[i]);

Example output:

Test string!
char value is -1
unsigend char value is 255
short value is -19935
unsigend short value is 45601
int value is -2074751535
unsigend int value is 2220215761
bfloat value is 46.250000
float value is 15.234230
Vector Print:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... ,  62, 63