Implementing and Integrating New lib
On this Page
Implementing and Integrating New lib¶
Coding¶
Implement the following components in order to add a new lib that contains your implemented kernel:
Kernels
Glue Code
Tests (optional)
For the complete code examples, please visit - Intel Gaudi Custom Kernel.
Kernels¶
The kernel is written in TPC-C language as described in TPC Programming Language. The kernel is a main function and its signature contains a list of parameters. Tensors and scalars are parameters with some restrictions.
The following is an example of a simple kernel:
void main(tensor inputA, tensor outputB)
{
const int dim0 = 0;
const int dim1 = 1;
const int dim2 = 2;
const int dim3 = 3;
// Index space coordinates
const int5 idx_s = get_index_space_offset();
const int5 idx_e = get_index_space_size() + idx_s;
int5 ifmCoords = {0, 0, 0, 0, 0};
float64 in,out;
for (int idx0 = idx_s[dim0]*64; idx0 < idx_e[dim0] * 64; idx0 += 64)
{
ifmCoords[dim0] = idx0;
for (int idx3 = idx_s[dim3]; idx3 < idx_e[dim3]; idx3 += 1)
{
ifmCoords[dim3] = idx3;
for (int idx2 = idx_s[dim2]; idx2 < idx_e[dim2]; idx2 += 1)
{
ifmCoords[dim2] = idx2;
for (int idx1 = idx_s[dim1]; idx1 < idx_e[dim1]; idx1 += 1)
{
ifmCoords[dim1] = idx1;
in = v_f32_ld_tnsr_i(ifmCoords, inputA);
out = v_f32_abs_v(in);
f32_st_tnsr_i_v(ifmCoords, outputB, out);
}
}
}
}
}
Glue Code¶
The program and its associated definition set is passed to the graph compiler to be incorporated into the DNN topology through a host side interface called Glue Code. The outer component (GraphCompiler) interacts with the new lib through two connectivity points:
GetKernelNames
HabanaKernel
An example of these two methods is found under entry_points.cpp
.
GetKernelNames¶
gcapi::GlueCodeReturn_t GetKernelNames(_OUT_ char** names,
unsigned* kernelCount,
gcapi::DeviceId_t deviceId);
The method returns the exported kernel names list. The kernel name must not exceed 64 bytes length.
names
: [out] List of strings to be filled with kernel names.kernelCount
: [in/out].[in] The maximum number of strings in ‘names’ argument.
[out] If the number of kernels <= maximum list length, copy the kernel names into the list(names) and update the number of kernels, otherwise just update the required list length.
DeviceId
- [in] The type of device (an enum undertpc_kernel_lib_interface.h
). Possible values:tpc_lib_api::DEVICE_ID_GAUDI
tpc_lib_api::DEVICE_ID_GAUDI2
HabanaKernel¶
tpc_lib_api::GlueCodeReturn
InstantiateTpcKernel(_IN_ const tpc_lib_api::HabanaKernelParams* params,
_OUT_ tpc_lib_api::HabanaKernelInstantiation* instance);
typedef struct _HabanaKernelParams
{
_IN_ int apiVersion;
_IN_ DeviceId deviceId; /* asic ID */
_IN_ GuidInfo guid; /* GUID of node in the graph */
_IN_ UserParams nodeParams; /* Kernel specific parameters */
_IN_ Tensor* inputTensors; /* array of the input tensor */
_IN_ uint32_t inputTensorNr; /* the number of input tensors */
_IN_ Tensor* outputTensors; /* array of the output tensor */
_IN_ uint32_t outputTensorNr; /* the number of output tensors. */
_IN_ uint32_t maxAvailableTpc; /* The maximum amount of TPC engines the kernel will execute on.*/
_IN_ uint32_t useDeterministic; /* directive to return deterministic version of program */
_IN_ uint64_t uniqueNodeId; /* provided to be able to easily trace in the logs */
_IN_ uint32_t debugFlags; /* for internal use.- used to debug/profile */
_IN_ uint16_t validInputTensors; /* Valid input tensors bit mask */
_IN_ uint16_t validOutputTensors; /* Valid output tensors bit mask */
uint32_t reserved[24];
} HabanaKernelParams;
The method is the new kernels lib main entry point.
params
:[in] The kernel properties:Requested kernel name and data type (e.g. maxpool_2d_i8 / averagepool_2d_f32 etc).
Number of input/output tensor for the kernels.
For each input/output tensor the graph compiler supplies:
Data type
Size in each dimension
Quantization parameters (scale /zero point)
typedef struct _HabanaKernelInstantiation
{
_OUT_ uint32_t indexSpaceRank;
_OUT_ uint64_t indexSpaceGeometry[MAX_INDEX_SPACE_DIM_SIZE];
_OUT_ TensorAccessPattern* inputTensorAccessPattern;
_OUT_ TensorAccessPattern* outputTensorAccessPattern;
_INOUT_ AuxTensor* auxiliaryTensors; // see comment below
_INOUT_ uint32_t auxiliaryTensorNr;
_INOUT_ DeviceKernel kernel;
_OUT_ uint32_t preferredSplitDim; /* Set a dimension that GC code-gen pass must
partition with each partition size of 1.
This dimension must be set as allRequired within
DimIndexSpaceMapping for all tensors.
At the GC code_gen level (after slice pass),
GC must begin with this dimension with each
partition size of 1 and may continue to other
dimensions in case preferredSplitDim is smaller
than maxAvailableTpc
API:
0 - Disabled feature (default),
Otherwise - (preferredSplitDim-1) is dim to split
E.g. - 1 means dim0 (FCD), etc.
*/
uint32_t reserved[15];
} HabanaKernelInstantiation;
instance
:[out] Returned kernel final properties.Program binary.
Size of index space as described in Index Space.
Index space mapping as described in Index Space Mapping.
Values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).
Glue code should perform the following:
Verify input/output tensors properties are correct (fits the kernel definition):
Input/output tensors count matches the kernel definition.
Input/output tensors dimensions matches the kernel definition.
Input/output tensors data type matches the kernel definition.
Return program binary.
Return size of index space as described in Index Space.
Return index space mapping as described in Index Space Mapping.
Return values of scalar parameter given to TPC-C ‘main’ function (up to 32 dwords).
Build lib Project¶
Building the lib project requires the habanatools
package. See installation instructions provided in the TPC Tools Installation Guide.
Upon successful compilation, the new lib is generated:
<build path>/builds/<Debug or Release>/src/lib<name>_kernels.so
– the plugin shared object to be loaded by the Intel Gaudi software in production.
Print¶
‘printf’ is a built-in utility function exposed by the TPC compiler to the TPC kernel writers. It enables entry level debugging capabilities in the TPC processor. Establishing an ABI between the compiler and Intel Gaudi runtime implements printf.
Syntax¶
Printf syntax is identical to C runtime library syntax with the following restriction:
Printf accepts, at most, only one variable in addition to the message string.
To enable printf support, define the following pragma:
#pragma tpc_printf(enable)
Scalar printing - Similar to C library function- printf(“depth=%d “,depth);
Vector print - You can use a loop to print the whole vector or just part of it. For example, vector of floats (64 elements in a vector) for (int i=0; i<64; i++) { printf(“%f, “, vec[i]);}
The code below demonstrates the printing format:
#pragma tpc_printf(enable)
void printTest(void)
{
char char_val = 0xff;
unsigned char uchar_val = 0xff;
short short_val = 0xb221;
unsigned short ushort_val = 0xb221; //45,601
int int_val = 0x8455CDD1;
unsigned int uint_val = 0x8455CDD1; //2,220,215,761
bf16 bf16_val = 46.25;
float float_val = 15.23423;
/*V_LANE_ID_32 vector, values 0-63 */
uint64 vec_lane_id = V_LANE_ID_32;
printf("Test string!\n");
printf("char value is %hhd\n", char_val);
printf("unsigend char value is %hhu\n", uchar_val);
printf("short value is %hd\n", short_val);
printf("unsigend short value is %hu\n", ushort_val);
printf("int value is %d\n", int_val);
printf("unsigend int value is %u\n", uint_val);
printf("bfloat value is %bf\n", bf16_val);
//printf("half float value is %hf\n", f16_val);
printf("float value is %f\n", float_val);
printf("Vector Print:\n");
printf("=============\n");
for (int i = 0; i < 64; i++)
{
printf("%u, ", vec_lane_id[i]);
}
}
Example output:
31 32 33 34 35 36 37 38 39 40 41 42 | Test string!
char value is -1
unsigend char value is 255
short value is -19935
unsigend short value is 45601
int value is -2074751535
unsigend int value is 2220215761
bfloat value is 46.250000
float value is 15.234230
Vector Print:
=============
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... , 62, 63
|