2. TPC User Guide

2.1. Introduction

The Tensor Processor Core™ (TPC™) is a fully programmable VLIW4 processor designed to execute non-linear deep learning operators, such as spatial pooling/batch normalization. It is embedded in Habana’s deep learning accelerators. Habana’s SoC contains numerous TPC cores all operating in parallel, with each core running a single thread. TPC is designed with Very Long Instruction Word (VLIW) architecture. Its wide Single Instruction Multiple Data (SIMD) vector unit supports 2048-bit SIMD operations with data types such as float, bfloat16, INT16, INT32 and INT8. In each cycle, the TPC’s ALU can execute up to 64 floats/INT32 ops, 128 INT16 ops, or 256 INT8 ops.


Figure 2.2 Neural Network Hardware Mapping – Use of MME and TPC

2.2. TPC Programming Language

The TPC core can be programmed using a derivative of C language called TPC-C.

2.2.1. TPC C

TPC-C programming language is used to author TPC programs (AKA kernels) that are executed on TPC device(s). The TPC programming language is based on the ISO/IEC 9899:1999 C language specification with TPC specific extensions and restrictions. Please refer to the ISO/IEC 9899:1999 specification for a detailed description of the language grammar. TPC-C is a derivative of C99 with added vector data types to enable easy utilization of processor-unique SIMD capabilities.

It has many dedicated features to accelerate DNN ops such as:

  • Tensor-based memory accesses

  • Accelerations for special functions

  • Random number generation

  • Multiple data types similar to the MME

2.2.2. TPC Program Components

A TPC program consists of two parts:

  • TPC code

  • Host glue code

2.2.3. TPC Code

TPC code is the ISA that the TPC processor executes. It contains the kernel implementation.

2.2.4. Host Glue Code

Host glue code is executed on the host machine serviced by the Habana DNN SoC, and holds specifications regarding how the program input/outputs can be dynamically partitioned between the numerous TPC processors in the Habana device.

2.3. Processor Architectural Overview

2.3.1. Instruction Slots and Processor Pipeline

The TPC processor has four execution slots:

  • Load slot - loads from memory, moves and set values.

  • SPU slot - performs scalar arithmetic.

  • VPU slot - performs vector arithmetic.

  • Store slot - stores to memory, moves and set values.


Figure 2.3 Example TPC Instruction Assembly – LOAD, SPU, VPU and STORE

TPC has an exposed pipeline architecture. Each instruction has a predefined latency, with four cycles being the most prevalent latency. Its result is visible to the software immediately after the defined latency period.

For example, the latency of multiplication instruction (MUL) is four cycles. In this case, the following code is legal:

  • Initial values are V0 = 0, V1 = 1, V2= 2.

  • ⇓ MUL V0, V1, V2 // V0 = V1*V2 -> V0 == 2.

  • ⇓ MUL V3, V0, 4 // V3 is equal to 0. V0 has not yet been updated.

  • ⇓ MUL V4, V0, 4 // V4 is equal to 0. V0 has not yet been updated.

  • ⇓ MUL V5, V0, 4 // V5 is equal to 0. V0 has not yet been updated.

  • ⇓ MUL V6, V0, 4 // V6 is equal to 8. The first multiplication result is visible.

2.3.2. Predication

All instructions in the TPC core can be predicated. Each VLIW slot is predicated in a different way:

  • The SPU and store slots support only scalar predication.

  • The VPU and Load slots can be predicated either by a single scalar value or by a bit array enabling masking of specific vector elements.

Predication is exposed to the TPC-C programmer through intrinsics.

2.3.3. Memory Spaces

The TPC processor has four memory spaces:

  • Scalar Local Memory

  • Vector Local Memory

  • Global Memory

  • Configuration space Global Memory

Global memory is accessed using dedicated accessors called tensors. For more details about tensors, see TPC Programming Model.

Global memory is not coherent with program execution. This means that the program must issue an atomic semaphore operation when performing a read-after-write operation, in order to guarantee that the write operation result is visible before reading it back. A 2,048-bit vector can be loaded from or written to global memory every four cycles, on average. Local Memory

Each TPC processor has its own instance of local memory. Each TPC can only access its own local copy. That is, TPC A cannot access TPC B local memory.

Local memory is coherent with program execution and is divided to two banks:

  • Scalar local memoryL:

    • Size is 1 KB.

    • Reading/writing to this memory is allowed in aligned 4-byte chunks.

  • Vector local memory:

    • Size is 80 KB. If the program utilizes special functions such as tanh, sin, or cos, only 16 KBs are available.

    • Reading/writing to this memory is allowed in aligned 128-/256-byte chunks.

Local memory can be either read from or written to on every cycle with no BW constraint. Configuration Space

The TPC configuration space holds a set of definitions required to successfully execute a program such as tensor descriptors, program binary location, etc. The “Programming Reference Manual” further describes the structure of the configuration space. Under normal circumstances, a program should not modify the content of the configuration space.

2.4. TPC Programming Model

2.4.1. TPC Program Inputs/Outputs

A TPC program can only accept tensor objects as its input/output vehicles. A tensor is a multidimensional array. A 1D tensor can be thought of as a simple C array. Matrices are two dimensional tensors, which can be either row major or column major. The TPC processor supports tensors with 1–5 dimensions.

To make TPC programming easier, the TPC programming language includes a dedicated, built-in tensor data type. Tensor variables are opaque handles used to access elements of the tensor data structure using cartesian coordinates. If the cartesian coordinates fall outside the tensor on a read operation, a special padding value is returned. The padding value is determined by the glue code.

If the cartesian coordinates of a write operation fall outside the written tensor, the write operation is culled by the memory control unit. The TPC vector units always align with dimension zero (AKA dim0). On dim0 the write operation is partially culled if some values fall inside the tensor and some fall outside of it. A TPC program can address up to eight tensors in a single program. This set of tensors can be arbitrarily divided between input and output tensors. The TPC programming language specifies dedicated-access intrinsic functions to read/write data from the tensors (for example, v_f32_ld_tnsr_i/v_f32_st_tnsr_i).

2.4.2. General limitations of TPC-C

Scalar variables can be assigned into a specific vector lane, or broadcasted to all vector datatype lanes, but vector lane values cannot be assigned back into scalar variables.

// supported code
float a = 65;
float64 b = a;

// unsupported code:
float64 b ;
float a = b[44];

A TPC-C program can access at most 16 tensors overall. Any partition between input tensors and output tensors is supported. If the program uses printf, only 15 tensors can be used.

2.4.3. Index Space

The Gaudi ASIC has multiple TPC processors. Habana introduced index spacing to effectively divide workloads between TPC processors. To achieve good workload distribution when writing a TPC program, you must define how the inputs/outputs of a TPC program can be partitioned into indivisible units of work. Defining a multiple dimension index space defines partitions. The index space may have one to five dimensions defined by the kernel writer depending on the dimensionality and sizes of the input/output tensors and the semantics of the operation itself.

For example, assume we want to write a program that performs elementwise add operations on two 2D tensors of size (3*192) with a single precision data type.

Since the VPU unit processes 64 single precision elements in one instruction, an adequate index space for such input would be a two-dimensional index space of size (3,3). A (3,3) sized index space would have nine members. Each member of the index space is responsible for processing 64 elements in the input/output tensors, as illustrated Figure 2.4.


Figure 2.4 Values of Nine Index Space Members in a (3,3) Index Space

In this example, each member of the index space directly correlates to 64 members (elements) in the resulting tensor. The machinery around the TPC may invoke the TPC program several times, each time with a different contiguous subset of the index space. In our examples, the program may be invoked only once for the entire index space (0,0)–(2,2), or it may be invoked up to nine times, each time for a single index space member. When program execution starts, the program should call built-in functions get_index_space_offset and get_index_space_size to query which subset of the index space it is invoked against (see example below). It then uses this result to perform the actual computation. Each member of the index space represents a unit of work that is executed on a single TPC and is independent of other index space members. The index space members can be executed at different times and on different TPC engines. Therefore, you cannot assume any order of execution between index space points or try to share data between them. You can assume that each index space member will be invoked exactly once and that all index space members will eventually be invoked.

void main(tensor inputA, tensor inputB, tensor outputC)

 int5 start = get_index_space_offset();
 int5 end = start + get_index_space_size();
 int5 targetCoord = { 0 };
 for(int i = start[0]; i < end[0]; i++)
 targetCoord[1] = i;
 for (int j = start[1]; j < end[1]; j++)
   targetCoord[2] = j;
   float64 a = v_f32_ld_tnsr(targetCoord,inputA);
   float64 b = v_f32_ld_tnsr(targetCoord,inputB);
   float64 c = a + b;

Depending on a variety of considerations, the machinery around the TPC may invoke the program several times. Each program invocation is called a program instance, and is invoked with a unique contiguous subset of the index space.

Examples of several options that can call the program are as follows:

  • It may invoke the program only once with the following offset/size:

    • Offset (0,0), size (3,3)

  • It may invoke the program three times, once for each row of the tensor:

    • Offset (0,0), size (1,3)

    • Offset (1,0), size (1,3)

    • Offset (2,0), size (1,3)

  • It may invoke the program twice:

    • Offset (0,0), size (2,3)

    • Offset (2,0), size (1,3)

The execution model has two restrictions:

  • No member of the index space can be called twice.

  • All members of the index are addressed.

2.4.4. Index Space Mapping

When writing TPC, you can also define how the index space maps to each of the program’s input/output tensor elements. The pattern specifies for each member in the index space which tensor elements are read/written during its associated computation. The mapping is made from the index space values as input, to an Agronrange of element coordinates in each dimension and for each input/output tensor. This mapping intends to help the graph compiler improve the pipelining between MME and TPC.

This mapping, however, is not mandatory. You can skip this mapping by flipping the allRequired flag and glue-code for the kernels glue code. This prevents fine-grain pipelining between MME and TPC and enables a fully functional kernel.

Simple linear transformation does the mapping: starta x + startb to enda x + endb. You must define for each dimension of each input/output tensor to which dimension of the index-space it maps, and to provide four constants - starta, enda, startb, endb. \(x\) is the index space member value as defined in Figure 2.4.



Figure 2.5 1D input tensor, 128 elements

Consider the function abs activated on a 1D single precision tensor of size 128.

When writing glue code, you are more likely to choose a 1D index space of size 2, since the array can be processed with two VPU operations. Index space member (0) should be mapped to applying abs to elements 0-63 of the vector, and index space member (1) should be mapped to applying abs to elements 64-127 in the array.

  • The a/b constants for such use case would be:

    • starta = 64, startb = 0

    • enda = 64, endb = 63

  • The mapping between index space and tensors would be:

    • startF(x) = 64*x + 0

    • endF(x) = 64*x+63

  • When evaluating the first index space element (0)

    • startF(0) = 64*0 + 0 = 0

    • endF(0) = 64*0 + 63 = 63

  • When evaluating the second index space element (1)

    • startF(1) = 64*1 = 64

    • endF(1) = 64*1 + 63 = 127

A set of starta, enda,startb, endb is defined for each dimension of each input/output tensor of a kernel.

Full end-to-end examples can be found in the github repo under: - Hanaba Custom Kernel

  • /kernels/gaudi/filter_fwd_2d_bf16.c – TPC-C code

  • /src/gaudi_src/filter_fwd_2d_bf16.cpp – Glue code

  • /src/spatial_reduction_kernels.cpp – Glue code

2.4.5. Additional Considerations

  • Several program instances may execute concurrently, as there are several TPC processors in the accelerator. Sharing memory between index execution is only possible using ASO instructions.

  • The order of instance execution is not guaranteed.

2.4.6. Data Layout for Convolutional Neural Networks

Two allocations represent a tensor in memory – a contiguous slab of memory holding the tensor content, and a tensor descriptor holding stride and size values for each dimension.

For example, a tensor representing a row major matrix of size (3,10) of floats is represented by a 120‑byte array (3*10*4 bytes/element) and a 2D array holding the following values:

  • dim 0 (size =10, stride = 1)

  • dim 1 (size = 3, stride = 10)

The stride value represents the number of elements separating one member of the dimension to the next. The dimension whose stride value equals 1 is called the fastest changing dimension. The fastest changing dimension is always dimension 0 with TPC. Convolutional neural networks accept 3D, 4D tensors as input. Habana devices can only effectively support input tensors with an NHWC layout, meaning that only tensors whose channel component is the fastest changing and in dimension 0. A TPC program incorporated into CNNs should assume this layout.

2.5. TPC-C Language

The TPC compiler accepts a derivative of the C99 standard C language as input.

2.5.1. Built-in Types

The language now has several new built-in types to support the SIMD capabilities of the processor. The table below describes them.

Table 1: Extended TPC Data Types




Opaque handle pointing to a tensor object.


5-dimensional Cartesian coordinates pointing into a tensor.


64-element vector; each element is a 4-byte single precision value.


128-element vector; each element is a 2-byte floating point value.


128-element vector; each element is a 2-byte signed/unsigned integer value.


64-element vector; each element is a 4-byte integer value.


64-element vector; each element is a 4-byte unsigned integer value.


256-element vector; each element is a 1-byte integer value.


256-element vector; each element is a 1-byte integer value.


256-element vector; each element is a 1-bit value. Only logical operations are supported with this type.

2.5.2. Global Memory Space

The global memory space maps to memory external to the TPC processor.

The following apply to global memory:

  • Tensor objects are always nested in global memory.

  • Only built-in gen_addr intrinsics can initialize pointers to global memory. They are immutable. Their pointed address cannot be changed after initialization.

  • The __global__ address space qualifier appends pointers to global memory.

  • Only pointers to scalar data types can be initialized.

  • Global memory is not coherent. Call the aso intrinsic when performing a read-after-write operation.

  • Global memory cannot be statically allocated at compile time nor dynamically allocated using C runtime functions, such as malloc/free. Synapse runtime pre-allocates tensors before program execution.

__local__ int localArray[5];
void main (tensor t1)

 int5 offset = {0,1,2,3,3};
 __global__ int* pointer = a_gen_addr_i_b(t1, offset);
 int tmp = *(pointer);
 tmp = tmp + localArray[0];
 *pointer = tmp;

// Illegal syntax global pointer cannot point to local memory.
__global__ int* pointer = &(localArray[1]);
// Illegal declaration - the program cannot statically allocate global
__global__ int64 array [64];

2.5.3. Local Memory Space

The local memory space is a private, adjacent memory space to the TPC processor memory. Each TPC processor has its own copy of local memory. Local memory offers improved latency and bandwidth on repetitive read/write operations.

Local memory is statically allocated at compile time through definition of global variables bearing the __local__ address space qualifier. See the example below for reference. The following apply to local memory:

  • Local memory is sequentially consistent with program instance execution. Read-after‑write memory barrier instructions are not needed.

  • Local memory can only be allocated statically at compile time.

  • There are two banks of local memory:

    • The local memory size for scalar types is 1 KB.

    • The local memory size for vector types is either 16 KB or 80 KB. If the program uses special functions, the available VLM size is reduced to 16 KB.

__local__ float64 polynom_constants[3];

void main(tensor inputA, tensor inputB, tensor outputC)
    int5 targetCoord = { 0 };
    targetCoord[0] += 1;
    targetCoord[0] += 1;
    // use 'polynom_constants’ here
} Built-in Global Variables

The built-in global variables can be accessed using the following functions:

  • Accessing these registers (LFSR is for vector, S_LFSR is for scalar) causes a destructive read. Reading from this variable yields a different, uniformly pseudo-random result on each access. It is possible to write to this variable, seeding is therefore supported.

    char256 seed;
  • These variables (LFSR_NO_CHANGE is for vector, S_LFSR_NO_CHANGE is for scalar) return the next value to be returned from LFSR. Reading from the variable does not affect LFSR content. Updated the doc, fixed some errors and added a few contents.

    uint64 lane_id_32 = read_lane_id_4b_b();

    • read_lane_id_4b_b() retuns a 32-bit-wide lane_id vector.

      This vector has 64 elements, 32 bits each, and assigned with 0,1,2… and gets to 63.







      This function generates a predicate bitmask for a single element in a 32-bit-wide vector type. For example:

      bool256 mask = bv_u32_cmp_eq_v_s(lane_id_32, j); // 0 <= j <= 63

      In conjunction with move:

      float64 tmpV = v_f32_mov_s_vb(tmpS, tmpV, mask, 0);
      ushort128 lane_id_16 = read_lane_id_2b_b();
    • read_lane_id_2b_b() retuns a 16-bit-wide lane_id vector.

      This vector has 128 elements, 16 bits each, and assigned with 0,1,2… and gets to 127.







      uchar256 lane_id_8 = read_lane_id_1b_b();
    • read_lane_id_1b_b() retuns a 8-bit-wide lane_id vector.

      This vector has 256 elements, 8 bits each, and assigned with 0,1,2… and gets to 255.







2.6. Built-in Functions

2.6.1. Program Management Special Functions

The following program management special functions are available: int5 get_index_space_offset()



return value

Returns the index space offset for the current program invocation int5 get_index_space_size()



return value

Returns the index space size for the current program invocation unsigned int get_dim_size(tensor a, unsigned int dim)




[in] Tensor handle.


[in] Tensor dimension index to be queried.

return value

Tensor dimension size, in elements. unsigned int get_dim_stride(tensor a, unsigned int dim)




[in] Tensor handle.


[in] Tensor dimension index to be queried.

return value

Tensor dimension stride, in elements. unsigned int get_pad_value_<tensor data type>(tensor a)




[in] Tensor handle.

return value

Tensor’s pad value.

This function is supported for the following data types:

  • uint

  • int

  • float

  • bf16

  • short

  • ushort

  • char

  • uchar void set_pad_value_<tensor data type>(tensor a,<tensor data type> val)




[in] Tensor handle.


New pad value to set.

This function supports the following data types:

  • uint

  • int

  • float

  • bf16

  • short

  • ushort

  • char

  • uchar

2.6.2. Built-in Special Functions

Table 2 describes the available built-in special functions.

Set the following flag to use special functions in your TPC-C code:

specialFunctionsUsed = 1 in the glue-code

Table 2: Built-in Special Functions


Single-precision Floating Point – max ULPs

float64 v_reciprocal_f32(float64 x)


float64 v_sqrt_f32(float64 x)


float64 v_exp_f32(float64 x)


float64 v_exp_cephes_f32(float64 x)


float64 v_log_f32(float64 x)


float64 v_log2_f32(float64 x)


float64 v_tanh_f32(float64 x)


float64 v_pow_f32(float64 x, float64 y)


float64 v_pow2_f32(float64 x)


float64 v_rsqrt_f32(float64 x)


float64 v_div_f32(float64 x, float64 y)


float64 v_sin_f32(float64 x)


float64 v_cos_f32(float64 x)


float64 v_tan_f32(float64 x)


float64 v_sigmoid_f32(float64 input)


float64 v_asin_cephes_f32(float64 input)


float64 v_acos_cephes_f32(float64 input)


float64 v_atan_cephes_f32(float64 input)


float64 v_asinh_f32(float64 input)


float64 v_acosh_f32(float64 input)


float64 v_atanh_f32(float64 input)


float64 v_sinh_cephes_f32(float64 input)


float64 v_cosh_cephes_f32(float64 input)


float64 v_mod_f32(float64 input)


float64 v_expm1_f32(float64 input)

10 INT8/INT16 Built-in Special Functions

The following INT8/INT16 built-in special functions are available:

  • int8 tanh(int8 a);

  • int16 tanh(int16 a);

  • int8 sigmoid(int8 a);

  • int16 sigmoid(int16 a);

  • int8 exp(int8 a); // for X < 0

  • int16 exp (int16 a); // for X < 0

  • 1/x for x in [0.5 , 1) Intrinsics

Every TPC instruction is wrapped with an intrinsic for every supported data type and scalar/vector argument combination.

The intrinsic function name is usually derived from the instruction name, instruction data type, return data type width, scalar/vector properties of its arguments and predicate values.

The intrinsic naming convention adheres to the following pattern:

<return type width>_<instruction datatype>_<instruction name>_<arg1
width>_<arg2 width>_<b|bv>( arguments… );
  • The return type width can be:

<return type width>



Vector type


Augmented vector (4096-bit or 8192-bit vectors)


Scalar type


Boolean data type


Boolean vector data type

  • The instruction type can be:

<instruction datatype>



Single-precision floating point


32-bit signed integer


32-bit unsigned integer


Brain floating point


16-bit signed integer


16-bit unsigned integer


8-bit signed integer


8-bit unsigned integer


INT5 data type

  • The argument width can be:

< arg width>



Scalar data type


Vector data type

  • Predicate arguments can be:

Predicate Argument



Scalar Boolean


Vector Boolean

Intrinsic usage example:

bool256 bv_u16_cmp_leq_v_v_b(ushort128 a,ushort128 b, bool
predicate,bool predicatePolarity);

bool256 bv_f32_cmp_leq_v_s_vb(float64 a, float b, bool256 predicate,
bool predicatePolarity);

float64 v_f32_mul_v_v_b(float64 a, float64 b, bool predicate, bool

2.6.3. Built-in Vector Reduction Intrinsics

Vector reduction intrinsics provide an easy way to compute the summation, product, minimum, maximum, argmin and argmax of a vector. The vector values are reduced to a single value, and then it is broadcasted to all lanes of the result vector.

Table 3 describes the available built-in reduction intrinsics for different datatypes.

Table 3: Built-in Reduction Intrinsics

Reduction Intrinsics


float64 v_f32_reduce_add(float64 x)

Summation of all elements of the F32 vector

float64 v_f32_reduce_mul(float64 x)

Product of all elements of the F32 vector

float64 v_f32_reduce_min(float64 x)

Minimum value of all elements of the F32 vector

float64 v_f32_reduce_max(float64 x)

Maximum value of all elements of the F32 vector

uint64_float64_pair_t v_f32_reduce_argmin(float64 x)

Index of the minmum value of all elements of the F32 vector

uint64_float64_pair_t v_f32_reduce_argmax(float64 x)

Index of the maximum value of all elements of the F32 vector

int64 v_i32_reduce_add(int64 x)

Summation of all elements of the I32 vector

int64 v_i32_reduce_max(int64 x)

Maximum value of all elements of the I32 vector

uint64_int64_pair_t v_i32_reduce_argmin(int64 x)

Index of the minmum value of all elements of the I32 vector

uint64_int64_pair_t v_i32_reduce_argmax(int64 x)

Index of the maximum value of all elements of the I32 vector

bfloat128 v_bf16_reduce_add(bfloat128 x)

Summation of all elements of the BF16 vector

bfloat128 v_bf16_reduce_min(bfloat128 x)

Minimum value of all elements of the BF16 vector

bfloat128 v_fb16_reduce_max(bfloat128 x)

Maximum value of all elements of the BF16 vector

short128 v_i16_reduce_min(short128 x)

Minimum value of all elements of the I16 vector

short128 v_i16_reduce_max(short128 x)

Maximum value of all elements of the I16 vector

char256 v_i8_reduce_min(char256 x)

Minimum value of all elements of the I8 vector

char256 v_i8_reduce_max(char256 x)

Maximum value of all elements of the I8 vector

uchar256 v_u8_reduce_min(uchar256 x)

Minimum value of all elements of the U8 vector

uchar256 v_u8_reduce_max(uchar256 x)

Maximum value of all elements of the U8 vector

2.6.4. Exceptions to C99 standard Initialization of Bool256 Variable

The compiler regards Bool256 as an array of chars with length of 32. Use the following syntax to initialize all bits of the array to one:

bool256 a = {0xff} ; Initialization of Local Memory

According to C99 : “If an object that has static or thread storage duration is not initialized explicitly and if it has arithmetic type, it is initialized to (positive or unsigned) zero;”*

For performance considerations, local memory is left un-utilized in the beginning of a program, although having static storage duration.

2.7. Implementing and Integrating New lib

2.7.1. Coding

Implement the following components in order to add a new lib that contains your implemented kernel:

  • Kernels

  • Glue Code

  • Tests (optional)

For the complete code examples, please visit - Hanaba Custom Kernel. Kernels

The kernel is written in TPC-C language as described in TPC Programming Language. The kernel is a main function and its signature contains a list of parameters. Tensors and scalars are parameters with some restrictions.

The following is an example of a simple kernel:

void main(tensor inputA, tensor outputB)
    const int dim0 = 0;
    const int dim1 = 1;
    const int dim2 = 2;
    const int dim3 = 3;
    // Index space coordinates
    const int5 idx_s = get_index_space_offset();
    const int5 idx_e = get_index_space_size() + idx_s;
    int5 ifmCoords = {0, 0, 0, 0, 0};

    float64 in,out;
    for (int idx0 = idx_s[dim0]*64; idx0 < idx_e[dim0] * 64; idx0 += 64)
        ifmCoords[dim0] = idx0;
        for (int idx3 = idx_s[dim3]; idx3 < idx_e[dim3]; idx3 += 1)
            ifmCoords[dim3] = idx3;
            for (int idx2 = idx_s[dim2]; idx2 < idx_e[dim2]; idx2 += 1)
                ifmCoords[dim2] = idx2;
                for (int idx1 = idx_s[dim1]; idx1 < idx_e[dim1]; idx1 += 1)
                    ifmCoords[dim1] = idx1;
                    in = v_f32_ld_tnsr_i(ifmCoords, inputA);
                    out = v_f32_abs_v(in);
                    f32_st_tnsr_i_v(ifmCoords, outputB, out);
} Glue Code

The program and its associated definition set is passed to the Graph Compiler to be incorporated into the DNN topology through a host side interface called Glue Code. The outer component (GraphCompiler) interacts with the new lib through two connectivity points:

  • GetKernelNames

  • HabanaKernel

An example of these two methods is found under entry_points.cpp. GetKernelNames
gcapi::GlueCodeReturn_t GetKernelNames(_OUT_ char**          names,
                                        unsigned*            kernelCount,
                                        gcapi::DeviceId_t    deviceId);

The method returns the exported kernel names list. The kernel name must not exceed 64 bytes length.

  • names: [out] List of strings to be filled with kernel names.

  • kernelCount: [in/out].

    • [in] The maximum number of strings in ‘names’ argument.

    • [out] If the number of kernels <= maximum list length, copy the kernel names into the list(names) and update the number of kernels, otherwise just update the required list length.

  • DeviceId - [in] The type of device (an enum under gc_interface.h). Possible values:

    • gcapi::DEVICE_ID_GOYA

    • gcapi::DEVICE_ID_GAUDI HabanaKernel
HabanaKernel(_IN_  const gcapi::HabanaKernelParams_t* params,
             _OUT_ gcapi::HabanaKernelInstantiation_t*instance)
typedef struct _HabanaKernelParams_t
    _IN_     int          apiVersion;
    _IN_     char         nodeName[MAX_NODE_NAME];
    _IN_     UserParams_t NodeParams;                    /* user specific. */
    _IN_     DeviceId_t   deviceId;                      /* asic ID */
    _IN_     KernelType_t kernelType;                    /* deprecated */
    _INOUT_  Tensor_t     inputTensors[MAX_TENSOR_NR];   /* array of the input tensor handles.*/
    _INOUT_  unsigned     inputTensorNr;                 /* the number of input tensors */
    _INOUT_  Tensor_t     outputTensors[MAX_TENSOR_NR];  /* array of the output tensor handles. */
    _INOUT_  unsigned     outputTensorNr;                /* the number of output tensors. */
    _IN_     unsigned     debugFlags;                    /* for internal use.- used to debug/profile
                                                          *  programs. */
    _IN_     unsigned     NodeParamsSize;                /* Size of struct pointed by NodeParams */
    _IN_     unsigned     maxAvailableTpc;               /* Kernel writer should know that it will get any number between 1 and maxAvailableTpc
                                                          * Kernels that rely on number of TPC in the index space should expose index space size with maxAvailableTpc
                                                          * Examples: Sparse segment sum, Embedding bag kernels, etc .. */
             unsigned     reserved[28];
} HabanaKernelParams_t;

The method is the new kernels lib main entry point.

  • params:[in] The kernel properties:

    • Requested kernel name and data type (e.g. maxpool_2d_i8 / averagepool_2d_f32 etc).

    • Number of input/output tensor for the kernels.

    • For each input/output tensor the Graph Compiler supplies:

      • Data type

      • Size in each dimension

      • Quantization parameters (scale /zero point)

typedef struct _HabanaKernelInstantiation_t
    _OUT_   TensorGeometry_t      indexSpaceGeometry;
    _OUT_   TensorAccessPattern_t inputTensorAccessPattern[MAX_TENSOR_NR];
    _OUT_   PadValue              inputPadValues[MAX_TENSOR_NR];
    _OUT_   TensorAccessPattern_t outputTensorAccessPattern[MAX_TENSOR_NR];
    _INOUT_ AuxTensor_t           auxiliaryTensors[MAX_TENSOR_NR];
    _OUT_   unsigned              auxiliaryTensorCount;
    _INOUT_ DeviceKernel_t        kernel;
    _OUT_   ProgramFlags          flags;
    _INOUT_ void*                 kernelElf;
    _INOUT_ unsigned              elfSize;
    _OUT_   PadValue              outputMemsetValues[MAX_TENSOR_NR];
    _OUT_   unsigned              auxNotRequiringInit; /* This is a bit mask is defining which aux
                                                        *  tensor should be regarded as SRAM
                                                        * scratch pad aux tensor*/
            unsigned              reserved[16];
} HabanaKernelInstantiation_t;
  • instance:[out] Returned kernel final properties.

    • Program binary.

    • Size of index space as described in Index Space.

    • Index space mapping as described in Index Space Mapping.

    • Values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).

    • Optionally, decide the pad value of the input tensors.

Glue code should perform the following:

  • Verify input/output tensors properties are correct (fits the kernel definition):

    • Input/output tensors count matches the kernel definition.

    • Input/output tensors dimensions matches the kernel definition.

    • Input/output tensors data type matches the kernel definition.

  • Return program binary.

  • Return size of index space as described in Index Space.

  • Return index space mapping as described in Index Space Mapping.

  • Return values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).

  • Optionally, decide the pad value of the input tensors.

2.7.2. Build

Building the lib project requires the habanatools package. See installation instructions provided in the TPC Tools Installation Guide.

Upon successful compilation, the new lib is generated:

<build path>/builds/<Debug or Release>/src/lib<name>_kernels.so – the plugin shared object to be loaded by Synapse in production.

2.7.3. Print

‘printf’ is a built-in utility function exposed by the TPC compiler to the TPC kernel writers. It enables entry level debugging capabilities in the TPC processor. Establishing an ABI between the compiler and Habana runtime implements printf. Syntax

Printf syntax is identical to C runtime library syntax with the following restriction:

  • Printf accepts, at most, only one variable in addition to the message string.

To enable printf support, define the following pragma:

#pragma tpc_printf(enable)
  • Scalar printing - Similar to C library function- printf(“depth=%d “,depth);

  • Vector print - You can use a loop to print the whole vector or just part of it. For example, vector of floats (64 elements in a vector) for (int i=0; i<64; i++) { printf(“%f, “, vec[i]);}

The code below demonstrates the printing format:

#pragma tpc_printf(enable)

void printTest(void)
    char char_val = 0xff;
    unsigned char uchar_val = 0xff;
    short short_val = 0xb221;
    unsigned short ushort_val = 0xb221; //45,601
    int int_val = 0x8455CDD1;
    unsigned int uint_val = 0x8455CDD1; //2,220,215,761
    bf16 bf16_val = 46.25;
    float float_val = 15.23423;
    /*V_LANE_ID_32 vector, values 0-63 */
    uint64 vec_lane_id = V_LANE_ID_32;

    printf("Test string!\n");
    printf("char value is %hhd\n", char_val);
    printf("unsigend char value is %hhu\n", uchar_val);
    printf("short value is %hd\n", short_val);
    printf("unsigend short value is %hu\n", ushort_val);
    printf("int value is %d\n", int_val);
    printf("unsigend int value is %u\n", uint_val);

    printf("bfloat value is %bf\n", bf16_val);
    //printf("half float value is %hf\n", f16_val);
    printf("float value is %f\n", float_val);
    printf("Vector Print:\n");
    for (int i = 0; i < 64; i++)
        printf("%u, ", vec_lane_id[i]);

Example output:

Test string!
char value is -1
unsigend char value is 255
short value is -19935
unsigend short value is 45601
int value is -2074751535
unsigend int value is 2220215761
bfloat value is 46.250000
float value is 15.234230
Vector Print:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... ,  62, 63

2.8. TPC Coherency

TPC has a scalar D$ to keep coherency between its scalar load/store accesses to global memory, but it does not have a vector data-cache for its vector accesses to global memory. Vector loads are issued in scalar-pipe, a special HW mechanism hiding the global memory latency, and the data is read later during the vector-pipe. If a load arrives in the vector pipe before the data is ready, the vector pipe stalls. Vector stores are issued in vector-pipe after the vector data is ready. Therefore, store that comes after load will always keep data coherency (because it is issued only after the load data returns to the TPC). A vector load that comes in the code after a vector store to the same address is not coherent. It is unknown whether the load data that returns is the old data (before the store) or the new data (after the store, as it should). In that case, SPU pipe must be stalled until the vector store is complete (the data is written back to the global memory). The ASO (Atomic Semaphore Operation) instruction performs the stall. It ensures the TPC commits all older writes (at vector pipe) prior to updating the semaphore (at the scalar pipe). When coherency between vector load and scalar store accesses is required, use explicit fencing (cache_invalidate).

The following table summarizes all global load and store instructions, and the HW support for coherency in each case.


st_tnsr* refers to st_tnsr, st_tnsr_low, and st_tnsr_high.

Table 4: Different Coherency Cases

Older instruction

Younger instruction

Coherency kept by HW


ld_tnsr/ ld_g to vrf



Pipeline structure (ld_tnsr will always retire before younger st_tnsr is issued)


ld_tnsr/ ld_g to vrf



ld_g scalar/ prefetch


2.9. Multiple Kernel Libraries

GC_KERNEL_PATH defines multiple libraries, separated by a colon. For example:

export GC_KERNEL_PATH=/home/labuser/workspace/Habana_Custom_Kernel/build/src/
  • If you intend to use a kernel not written by you, augment Habana’s perf lib with your proprietary kernel.

  • When several identical GUIDs are available in the perf-lib list, the Graph Compiler picks the first one it finds, according to the order of libs in GC_KERNEL_PATH.

For example, if you want to override our implementation define:

export GC_KERNEL_PATH=/home/labuser/workspace/Habana_Custom_Kernel/build/src/

If not:

export GC_KERNEL_PATH=/usr/lib/habanalabs/libtpc_kernels.so:
  • If a specific data type is not found, the Graph Compiler will inject cast nodes to the model to F32 and search again.

2.10. Abbreviations




Tensor Processing Core


Vector Processing Unit


Scalar Processing Unit


Processor Element –32bits arithmetic building block


Vector Local Memory


Scalar Local Memory


Very Long Instruction Word


Single Instruction Multiple Data


Index Register File


Scalar Register File of the scalar pipe


Vector Register File


Scalar Predicate Register File


Vector Predicate Register File


Address and $ Attributes Register File


Look Up Table


Virtual Address


Instruction Cache


General Matrix Multiply


Matrix Multiply Engine


Deep Neural Network