Built-in Special Functions

Overview

The below table describes the available built-in special functions.

return value

Function

float

s_div_fast_f32(float, float, float)

float64

v_acos_cephes_f32(float64)

float64

v_acosh_f32(float64)

float64

v_asin_cephes_f32(float64)

float64

v_asinh_f32(float64)

float64

v_atan_cephes_f32(float64)

float64

v_atanh_f32(float64)

float64

v_cos_f32(float64)

float64

v_cos_fast_f32(float64)

float64

v_cosh_cephes_f32(float64)

float64

v_div_f32(float64, float64)

float64

v_div_fast_f32(float64, float64)

float64

v_exp_cephes_f32(float64)

float64

v_exp_cephes_fast_f32(float64)

float64

v_exp_f32(float64)

float64

v_exp_fast_f32(float64)

short128

v_exp_i16(short128, short)

char256

v_exp_i8(char256, char)

float64

v_expm1_f32(float64)

float64

v_log2_f32(float64)

float64

v_log2_fast_f32(float64)

float64

v_log_f32(float64)

float64

v_log_fast_f32(float64)

float64

v_mod_f32(float64, float64)

float64

v_pow2_f32(float64)

float64

v_pow2_fast_f32(float64)

float64

v_pow_f32(float64, float64)

float64

v_pow_fast_f32(float64, float64)

short128

v_recip_i16(short128, short)

float64

v_reciprocal_f32(float64)

float64

v_reciprocal_fast_f32(float64)

float64

v_rsqrt_f32(float64)

float64

v_rsqrt_fast_f32(float64)

float64

v_sigmoid_f32(float64)

short128

v_sigmoid_i16(short128, short, short, char, char)

float64

v_sin_f32(float64)

float64

v_sin_fast_f32(float64)

float64

v_sinh_cephes_f32(float64)

float64

v_sqrt_f32(float64)

float64

v_sqrt_fast_f32(float64)

float64

v_tan_cephes_f32(float64)

float64

v_tanh_f32(float64)

float64

v_tanh_fast_abs_in_f32(float64, float64)

float64

v_tanh_fast_f32(float64)

short128

v_tanh_i16(short128, short, short, char, char)

INT8/INT16 Built-in Special Functions

The following INT8/INT16 built-in special functions are available:

  • int8 tanh(int8 a);

  • int16 tanh(int16 a);

  • int8 sigmoid(int8 a);

  • int16 sigmoid(int16 a);

  • int8 exp(int8 a); // for X < 0

  • int16 exp(int16 a); // for X < 0

  • 1/x for x in [0.5 , 1)

Intrinsics

A set of intrinsics wrap every TPC machine instruction, allowing all legal combinations of arguments for all supported data types and scalar/vector type domains.

In most cases, the intrinsic function name is derived from the instruction name, instruction data type, return data type width, scalar/vector properties of its arguments and predicate values.

The common intrinsic naming convention adheres to the following pattern

<argument width>_<instruction data type>_<instruction name>_<b|vb>( arguments… );

The argument width can be:

  • v – vector type

  • s – scalar type

  • i - int5 type

The instruction data type:

  • f32 – single-precision floating point

  • i32 – 32-bit signed integer

  • u32 – 32-bit unsigned integer

  • i16 – 16-bit signed integer

  • u16 – 16-bit unsigned integer

  • f16 – Half-precision floating-point format

  • bf16 – bfloat floating-point format

  • i8 – 8-bit signed integer

  • u8 – 8-bit unsigned integer

  • h8 – fp_152 8-bit float number

  • f8 – fp_143 8-bit float number

  • i – int5 data type

Predicate arguments can be:

  • b – scalar boolean

  • bv – vector boolean

Examples:

bool128 v_u16_cmp_less_vb(ushort128 a, ushort128 b, int switches, bool128 income,
                          bool128 predicate, bool polarity=0);

float64 v_f32_mul_vb(float64 a, float64 b, int switches, float64 income,
                     bool64 predicate, bool polarity=0);

Intrinsic Arguments

  1. Operation input - List of source to the operation.

  2. Switches - Instructions extra control flag. Allows to control the instruction behavior.

  3. Income - Working argument to be update.

  4. Predicate - Mask the result of the operation with the income value. The length of the predicate can be either a scalar (all true or all false) or bits vector with a length of the income.

  5. Polarity - Reverse the predicate bits.

Examples:

float64 result = v_f32_mul_vb(a,b, 0, float64 income,
                              bool64 predicate, bool polarity=0);
float64 result;
result = v_f32_mul_vb(float64 a, float64 b, 0, c, predicate, 0);
for(int i=0;i<64;i++)
    result[i] = predicate[i] ? mul(a[i],b[i]) : c[i];

printf

‘printf’ is a built-in, utility function exposed by TPC compiler to TPC kernel writers. Its goal is to enable entry level debugging capabilities in the TPC processor. ‘printf’ is implemented by establishing an ABI between the compiler and Habana runtime.

printf Syntax

‘printf’ syntax is identical to the standard C library syntax with the following restrictions:

  • ‘printf’ accepts, at most, only one variable in addition to the message string.

  • ‘printf’ supports only scalar data types that are supported with TPC processor.

Note

Using #pragma tpc_printf(enable) enables ‘printf’ explicit support.

The compiler is not expected to detect ‘printf’ as a built-in function if the pragma is not enabled. The user can add ‘printf’ in the code. See below a simple printf example:

void main(tensor input)
{

int5 coords = { 0 };
addr0 = (__global__ float*)gen_addr(coords, input);
float val = s_f32_ld_g(__global void *addr);
printf("first value read is %f in tensor\n", val);
return;

}

Extension for Vector Prints

‘printf’ supports printing a single element of a vector using the following syntax:

float64 a = v_f32_ld_tnsr_i_b(input, coords);

printf(“first value read is %f in tensorn”, a[17]);

Element index can be arbitrary integer expression. Out-of-bounds behavior is unspecified. Such use of the subscript operator ([]) is specific to the printf facility. It is not supported in other expressions.

printf ABI

The program accepts a set of tensor objects as its input/output vehicles. When ‘printf’ extension is enabled, an additional hidden tensor will be defined. This hidden tensor may cause the compilation to fail if the program exceeds the number of permitted tensor objects.

The hidden tensor index is equal to the number of tensors the program uses. E.g. if the program uses two tensors, these tensors are numbered with indices 0-1, ‘printf’s hidden tensor occupies slot 2. The hidden tensor is a scalar tensor of type ‘unsigned int’. It is single dimension of arbitrary size.

  • The first entry written to the hidden tensor by a ‘printf’ function is a magic number (0xcdcdcdcd) to identify the beginning of a ‘printf’ session.

  • The second entry is the value to be printed in binary form. If there is no value (string only ‘printf’), this entry should be written nonetheless to keep parsing simple; in this case the written value is ignored.

  • Each next entry holds 4 characters in ascii representation. The compiler should add null termination character. If the printed string is not a multiple of 4, then trailing characters appearing after null termination are ignored.

  • The last entry in the hidden tensor is an additional magic value (0xffffffff). This magic value signifies the end of all ‘printf’ printing.

Index

Value

0

0xcdcdcdcd

1

val – int32 binary value to be printed.

2

‘firs’

3

‘t va’

4

‘lue ’

4

‘read‘

5

‘ is ‘

6

‘%f I’

7

‘n ten’

8

‘sorn’

Table 4.7. Example on how the sentence “first value read is %f in tensorn” is written to the hidden tensor

Note

If the amount of printing exceeds hidden tensor size, then some writes are lost.

Main ABI

The function ‘main’ shall return void. It can accept arbitrary number of formal parameters of scalar types. Only a basic scalar types can be passed to ‘main’ (e.g. int/float). Main cannot accept structs, union or vector types (such as int64, float64 etc).

The parameter list of ‘main’ may contain declarations of tensor handles. Such declaration is a pair of the keyword ‘tensor’ followed by an identifier. The identifiers in such declarations represent integer constants, which are assigned sequential values starting from 0. For instance, the declaration:

void main(tensor input, tensor output, tensor coeff)

is equivalent to the following:

const int input = 0;

const int output = 1;

const int coeff = 2;

void main()

The used architecture defines the number of tensors that may be declared in a program, whether the program uses ‘printf’ and whether it has more than 31 parameters (see below).

Declarations of tensors and parameters are independent; the following declarations are equivalent:

void main(tensor input, tensor output, int x, float y)

void main(tensor input, int x, float y, tensor output)

void main(int x, float y, tensor input, tensor output)

Examples on main definitions:

void main(int a, float b, char c, \_Bfloat16 d, tensor e); // allowed

void main(char256 a); //not allowed

The TPC has 32 scalar registers. The first 32 parameters, excluding tensor objects, are mapped according to their order to scalar registers 0 – 31. A controller, external to the TPC, loads values to these registers before program execution.

If the program requires more than 32 variables, it addresses a hidden tensor to fetch the rest of the argument. The compiler automatically injects code that reads content of this tensor and reads the extra arguments into local variable with the same name as the parameter.

Example on getting values of scalar arguments beyond 32nd explicitly:

In this example ‘arg34’ is a local variable initialized by the value read from the hidden tensor.

void main(tensor a, int arg1, int arg2, int arg3, int arg4, int arg5,

int arg6, int arg7, int arg8, int arg9, int arg10, int arg11,

int arg12, int arg13, int arg14, int arg15, int arg16,

int arg17, int arg18, int arg19, int arg20, int arg21,

int arg22, int arg23, int arg24, int arg25, int arg26,

int arg27, int arg28, int arg29, int arg30, int arg31,

int arg32, int arg33, int arg34)

{

   int5 offset = 0;

   i8_st_tnsr_i_v(offset, a, arg34);

}