Built-in Functions¶

Program Management Special Functions¶

The following lists the available program management special functions:

int5 get_index_space_offset()¶

Parameter	Description
return value	Returns the index space offset for the current program invocation.

int5 get_index_space_size()¶

Parameter	Description
return value	Returns the index space size for the current program invocation.

unsigned int get_dim_size(tensor a, unsigned int dim)¶

Parameter	Description
a	[in] Tensor handle.
dim	[in] Tensor dimension index to be queried.
return value	Tensor dimension size, in elements.

unsigned int get_dim_stride(tensor a, unsigned int dim)¶

Parameter	Description
a	[in] Tensor handle.
dim	[in] Tensor dimension index to be queried.
return value	Tensor dimension stride, in elements.

unsigned int get_pad_value_<tensor data type>(tensor a)¶

Parameter	Description
a	[in] Tensor handle.
return value	Tensor’s pad value.

This function supports the following data types:

uint
int
float
bf16
short
ushort
char
uchar

void set_pad_value_<tensor data type>(tensor a,<tensor data type> val)¶

Parameter	Description
a	[in] Tensor handle.
val	New pad value to set.

This function supports the following data types:

uint
int
float
bf16
short
ushort
char
uchar

Built-in Special Functions¶

Table 2 describes the available built-in special functions.

To use special functions in your TPC-C code, set the following flag:

specialFunctionsUsed = 1 in the glue-code

Table 2: Built-in Special Functions

Function	Single-precision Floating Point – max ULPs
float64 v_reciprocal_f32(float64 x)	2
float64 v_sqrt_f32(float64 x)	2
float64 v_exp_f32(float64 x)	2
float64 v_exp_cephes_f32(float64 x)	1
float64 v_log_f32(float64 x)	3
float64 v_log2_f32(float64 x)	3
float64 v_tanh_f32(float64 x)	3
float64 v_pow_f32(float64 x, float64 y)	20
float64 v_pow2_f32(float64 x)	2
float64 v_rsqrt_f32(float64 x)	3
float64 v_div_f32(float64 x, float64 y)	2
float64 v_sin_f32(float64 x)	2
float64 v_cos_f32(float64 x)	2
float64 v_tan_f32(float64 x)	3
float64 v_sigmoid_f32(float64 input)	16
float64 v_asin_cephes_f32(float64 input)	2
float64 v_acos_cephes_f32(float64 input)	2
float64 v_atan_cephes_f32(float64 input)	3
float64 v_asinh_f32(float64 input)	6
float64 v_acosh_f32(float64 input)	10
float64 v_atanh_f32(float64 input)	3
float64 v_sinh_cephes_f32(float64 input)	3
float64 v_cosh_cephes_f32(float64 input)	3
float64 v_mod_f32(float64 input)	70
float64 v_expm1_f32(float64 input)	10

INT8/INT16 Built-in Special Functions¶

The following lists the available INT8/INT16 built-in special functions:

int8 tanh(int8 a);
int16 tanh(int16 a);
int8 sigmoid(int8 a);
int16 sigmoid(int16 a);
int8 exp(int8 a); // for X < 0
int16 exp (int16 a); // for X < 0
1/x for x in [0.5 , 1)

Built-in Shuffle Functions¶

Shuffle functions perform elementwise operations on a vector and are available for 8bit, 16bit and 32bit elements.

The following sections list the available shuffle functions.

v_broadcast_element¶

Parameter	Description
input	[in] Vector.
N	[in] LANE_ID of the element to broadcast. 0 <= N <= max lane id.
return value	Vector with the selected element broadcast to all lanes.

char256 v_broadcast_element_i8(char256 input, int N)
uchar256 v_broadcast_element_u8(uchar256 input, int N)
short128 v_broadcast_element_i16(short128 input, int N)
ushort128 v_broadcast_element_u16(ushort128 input, int N)
bfloat128 v_broadcast_element_bf16(bfloat128 input, int N)
half128 v_broadcast_element_f16(half128 input, int N) (for supported devices only)
int64 v_broadcast_element_i32(int64 input, int N)
uint64 v_broadcast_element_u32(uint64 input, int N)

v_element_shift_up¶

Parameter	Description
input	[in] Vector.
N	[in] The shift value. 0 < N <= max lane id.
return value	Vector with the elements shifted up to the Nth neighbor. The shift is cyclic.

char256 v_element_shift_up_i8(char256 input, int N)
uchar256 v_element_shift_up_u8(uchar256 input, int N)
short128 v_element_shift_up_i16(short128 input, int N)
ushort128 v_element_shift_up_u16(ushort128 input, int N)
bfloat128 v_element_shift_up_bf16(bfloat128 input, int N)
half128 v_element_shift_up_f16(half128 input, int N) (for supported devices only)
int64 v_element_shift_up_i32(int64 input, int N)
uint64 v_element_shift_up_u32(uint64 input, int N)

v_element_shift_down¶

Parameter	Description
input	[in] Vector.
N	[in] The shift value. 0 < N <= max lane id.
return value	Vector with the elements shifted down to the Nth neighbor. The shift is cyclic.

char256 v_element_shift_down_i8(char256 input, int N)
uchar256 v_element_shift_down_u8(uchar256 input, int N)
short128 v_element_shift_down_i16(short128 input, int N)
ushort128 v_element_shift_down_u16(ushort128 input, int N)
bfloat128 v_element_shift_down_bf16(bfloat128 input, int N)
half128 v_element_shift_down_f16(half128 input, int N) (for supported devices only)
int64 v_element_shift_down_i32(int64 input, int N)
uint64 v_element_shift_down_u32(uint64 input, int N)

v_element_shift_xor¶

Parameter	Description
input	[in] Vector.
N	[in] The lane id for xor operation. 0 < N <= max lane id.
return value	Vector with elements copied from a lane based on bitwise xor with own lane id.

char256 v_element_shift_xor_i8(char256 input, int N)
uchar256 v_element_shift_xor_u8(uchar256 input, int N)
short128 v_element_shift_xor_i16(short128 input, int N)
ushort128 v_element_shift_xor_u16(ushort128 input, int N)
bfloat128 v_element_shift_xor_bf16(bfloat128 input, int N)
half128 v_element_shift_xor_f16(half128 input, int N) (for supported devices only)
int64 v_element_shift_xor_i32(int64 input, int N)
uint64 v_element_shift_xor_u32(uint64 input, int N)

Intrinsics¶

Every TPC instruction is wrapped with an intrinsic for every supported data type and scalar/vector argument combination.

The intrinsic function name is usually derived from the instruction name, instruction data type, return data type width, scalar/vector properties of its arguments and predicate values.

The intrinsic naming convention adheres to the following pattern:

<return type width>_<instruction data type>_<instruction name>_<arg1
width>_<arg2 width>_<b|bv>( arguments… );

The return type width can be:

<return type width>	Description
V	Vector type
AV	Augmented vector (4096-bit or 8192-bit vectors)
S	Scalar type
B	Boolean data type
BV	Boolean vector data type

The instruction type can be:

<instruction data type>	Description
F32	Single-precision floating point
I32	32-bit signed integer
U32	32-bit unsigned integer
BF16	Brain floating point
I16	16-bit signed integer
U16	16-bit unsigned integer
I8	8-bit signed integer
U8	8-bit unsigned integer
I	INT5 data type

The argument width can be:

< arg width>	Descrtipiton
S	Scalar data type
V	Vector data type

Predicate arguments can be:

Predicate Argument	Description
B	Scalar Boolean
BV	Vector Boolean

Intrinsic usage example:

bool256 bv_u16_cmp_leq_v_v_b(ushort128 a,ushort128 b, bool
predicate,bool predicatePolarity);

bool256 bv_f32_cmp_leq_v_s_vb(float64 a, float b, bool256 predicate,
bool predicatePolarity);

float64 v_f32_mul_v_v_b(float64 a, float64 b, bool predicate, bool
predicatePolarity);

Built-in Vector Reduction Intrinsics¶

Vector Reduction Intrinsics provide an easy way to compute the summation, product, minimum, maximum, argmin and argmax of a vector. The vector values are reduced to a single value, and then it is broadcasted to all lanes of the result vector.

The table below describes the available built-in reduction intrinsics for different data types.

Table 3: Built-in Reduction Intrinsics

Reduction Intrinsics	Description
float64 v_f32_reduce_add(float64 x)	Summation of all elements of the F32 vector
float64 v_f32_reduce_mul(float64 x)	Product of all elements of the F32 vector
float64 v_f32_reduce_min(float64 x)	Minimum value of all elements of the F32 vector
float64 v_f32_reduce_max(float64 x)	Maximum value of all elements of the F32 vector
uint64_float64_pair_t v_f32_reduce_argmin(float64 x)	Index of the minmum value of all elements of the F32 vector
uint64_float64_pair_t v_f32_reduce_argmax(float64 x)	Index of the maximum value of all elements of the F32 vector
int64 v_i32_reduce_add(int64 x)	Summation of all elements of the I32 vector
int64 v_i32_reduce_max(int64 x)	Maximum value of all elements of the I32 vector
uint64_int64_pair_t v_i32_reduce_argmin(int64 x)	Index of the minmum value of all elements of the I32 vector
uint64_int64_pair_t v_i32_reduce_argmax(int64 x)	Index of the maximum value of all elements of the I32 vector
bfloat128 v_bf16_reduce_add(bfloat128 x)	Summation of all elements of the BF16 vector
bfloat128 v_bf16_reduce_min(bfloat128 x)	Minimum value of all elements of the BF16 vector
bfloat128 v_bf16_reduce_max(bfloat128 x)	Maximum value of all elements of the BF16 vector
short128 v_i16_reduce_min(short128 x)	Minimum value of all elements of the I16 vector
short128 v_i16_reduce_max(short128 x)	Maximum value of all elements of the I16 vector
char256 v_i8_reduce_min(char256 x)	Minimum value of all elements of the I8 vector
char256 v_i8_reduce_max(char256 x)	Maximum value of all elements of the I8 vector
uchar256 v_u8_reduce_min(uchar256 x)	Minimum value of all elements of the U8 vector
uchar256 v_u8_reduce_max(uchar256 x)	Maximum value of all elements of the U8 vector

Exceptions to C99 standard¶

Initialization of Bool256 Variable¶

The compiler regards Bool256 as an array of chars with length of 32. To initialize all bits of the array to one, use the following syntax:

bool256 a = {0xff} ;

Initialization of Local Memory¶

According to C99, “If an object that has static or thread storage duration is not initialized explicitly and if it has arithmetic type, it is initialized to (positive or unsigned) zero”.

For performance considerations, local memory is left un-utilized in the beginning of a program, although having static storage duration.

Gaudi Documentation 1.21.1 documentation

Built-in Functions

On this Page