Built-in Functions
On this Page
Built-in Functions¶
Program Management Special Functions¶
The following lists the available program management special functions:
int5 get_index_space_offset()¶
Parameter |
Description |
---|---|
return value |
Returns the index space offset for the current program invocation. |
int5 get_index_space_size()¶
Parameter |
Description |
---|---|
return value |
Returns the index space size for the current program invocation. |
unsigned int get_dim_size(tensor a, unsigned int dim)¶
Parameter |
Description |
---|---|
a |
[in] Tensor handle. |
dim |
[in] Tensor dimension index to be queried. |
return value |
Tensor dimension size, in elements. |
unsigned int get_dim_stride(tensor a, unsigned int dim)¶
Parameter |
Description |
---|---|
a |
[in] Tensor handle. |
dim |
[in] Tensor dimension index to be queried. |
return value |
Tensor dimension stride, in elements. |
unsigned int get_pad_value_<tensor data type>(tensor a)¶
Parameter |
Description |
---|---|
a |
[in] Tensor handle. |
return value |
Tensor’s pad value. |
This function supports the following data types:
uint
int
float
bf16
short
ushort
char
uchar
void set_pad_value_<tensor data type>(tensor a,<tensor data type> val)¶
Parameter |
Description |
---|---|
a |
[in] Tensor handle. |
val |
New pad value to set. |
This function supports the following data types:
uint
int
float
bf16
short
ushort
char
uchar
Built-in Special Functions¶
Table 2 describes the available built-in special functions.
To use special functions in your TPC-C code, set the following flag:
specialFunctionsUsed = 1 in the glue-code
Table 2: Built-in Special Functions
Function |
Single-precision Floating Point – max ULPs |
---|---|
float64 v_reciprocal_f32(float64 x) |
2 |
float64 v_sqrt_f32(float64 x) |
2 |
float64 v_exp_f32(float64 x) |
2 |
float64 v_exp_cephes_f32(float64 x) |
1 |
float64 v_log_f32(float64 x) |
3 |
float64 v_log2_f32(float64 x) |
3 |
float64 v_tanh_f32(float64 x) |
3 |
float64 v_pow_f32(float64 x, float64 y) |
20 |
float64 v_pow2_f32(float64 x) |
2 |
float64 v_rsqrt_f32(float64 x) |
3 |
float64 v_div_f32(float64 x, float64 y) |
2 |
float64 v_sin_f32(float64 x) |
2 |
float64 v_cos_f32(float64 x) |
2 |
float64 v_tan_f32(float64 x) |
3 |
float64 v_sigmoid_f32(float64 input) |
16 |
float64 v_asin_cephes_f32(float64 input) |
2 |
float64 v_acos_cephes_f32(float64 input) |
2 |
float64 v_atan_cephes_f32(float64 input) |
3 |
float64 v_asinh_f32(float64 input) |
6 |
float64 v_acosh_f32(float64 input) |
10 |
float64 v_atanh_f32(float64 input) |
3 |
float64 v_sinh_cephes_f32(float64 input) |
3 |
float64 v_cosh_cephes_f32(float64 input) |
3 |
float64 v_mod_f32(float64 input) |
70 |
float64 v_expm1_f32(float64 input) |
10 |
INT8/INT16 Built-in Special Functions¶
The following lists the available INT8/INT16 built-in special functions:
int8 tanh(int8 a);
int16 tanh(int16 a);
int8 sigmoid(int8 a);
int16 sigmoid(int16 a);
int8 exp(int8 a); // for X < 0
int16 exp (int16 a); // for X < 0
1/x for x in [0.5 , 1)
Built-in Shuffle Functions¶
Shuffle functions perform elementwise operations on a vector and are available for 8bit, 16bit and 32bit elements.
The following sections list the available shuffle functions.
v_broadcast_element¶
Parameter |
Description |
---|---|
input |
[in] Vector. |
N |
[in] LANE_ID of the element to broadcast. 0 <= N <= max lane id. |
return value |
Vector with the selected element broadcast to all lanes. |
char256 v_broadcast_element_i8(char256 input, int N)
uchar256 v_broadcast_element_u8(uchar256 input, int N)
short128 v_broadcast_element_i16(short128 input, int N)
ushort128 v_broadcast_element_u16(ushort128 input, int N)
bfloat128 v_broadcast_element_bf16(bfloat128 input, int N)
half128 v_broadcast_element_f16(half128 input, int N) (for supported devices only)
int64 v_broadcast_element_i32(int64 input, int N)
uint64 v_broadcast_element_u32(uint64 input, int N)
v_element_shift_up¶
Parameter |
Description |
---|---|
input |
[in] Vector. |
N |
[in] The shift value. 0 < N <= max lane id. |
return value |
Vector with the elements shifted up to the Nth neighbor. The shift is cyclic. |
char256 v_element_shift_up_i8(char256 input, int N)
uchar256 v_element_shift_up_u8(uchar256 input, int N)
short128 v_element_shift_up_i16(short128 input, int N)
ushort128 v_element_shift_up_u16(ushort128 input, int N)
bfloat128 v_element_shift_up_bf16(bfloat128 input, int N)
half128 v_element_shift_up_f16(half128 input, int N) (for supported devices only)
int64 v_element_shift_up_i32(int64 input, int N)
uint64 v_element_shift_up_u32(uint64 input, int N)
v_element_shift_down¶
Parameter |
Description |
---|---|
input |
[in] Vector. |
N |
[in] The shift value. 0 < N <= max lane id. |
return value |
Vector with the elements shifted down to the Nth neighbor. The shift is cyclic. |
char256 v_element_shift_down_i8(char256 input, int N)
uchar256 v_element_shift_down_u8(uchar256 input, int N)
short128 v_element_shift_down_i16(short128 input, int N)
ushort128 v_element_shift_down_u16(ushort128 input, int N)
bfloat128 v_element_shift_down_bf16(bfloat128 input, int N)
half128 v_element_shift_down_f16(half128 input, int N) (for supported devices only)
int64 v_element_shift_down_i32(int64 input, int N)
uint64 v_element_shift_down_u32(uint64 input, int N)
v_element_shift_xor¶
Parameter |
Description |
---|---|
input |
[in] Vector. |
N |
[in] The lane id for xor operation. 0 < N <= max lane id. |
return value |
Vector with elements copied from a lane based on bitwise xor with own lane id. |
char256 v_element_shift_xor_i8(char256 input, int N)
uchar256 v_element_shift_xor_u8(uchar256 input, int N)
short128 v_element_shift_xor_i16(short128 input, int N)
ushort128 v_element_shift_xor_u16(ushort128 input, int N)
bfloat128 v_element_shift_xor_bf16(bfloat128 input, int N)
half128 v_element_shift_xor_f16(half128 input, int N) (for supported devices only)
int64 v_element_shift_xor_i32(int64 input, int N)
uint64 v_element_shift_xor_u32(uint64 input, int N)
Intrinsics¶
Every TPC instruction is wrapped with an intrinsic for every supported data type and scalar/vector argument combination.
The intrinsic function name is usually derived from the instruction name, instruction data type, return data type width, scalar/vector properties of its arguments and predicate values.
The intrinsic naming convention adheres to the following pattern:
<return type width>_<instruction data type>_<instruction name>_<arg1
width>_<arg2 width>_<b|bv>( arguments… );
The return type width can be:
<return type width> |
Description |
---|---|
V |
Vector type |
AV |
Augmented vector (4096-bit or 8192-bit vectors) |
S |
Scalar type |
B |
Boolean data type |
BV |
Boolean vector data type |
The instruction type can be:
<instruction data type> |
Description |
---|---|
F32 |
Single-precision floating point |
I32 |
32-bit signed integer |
U32 |
32-bit unsigned integer |
BF16 |
Brain floating point |
I16 |
16-bit signed integer |
U16 |
16-bit unsigned integer |
I8 |
8-bit signed integer |
U8 |
8-bit unsigned integer |
I |
INT5 data type |
The argument width can be:
< arg width> |
Descrtipiton |
---|---|
S |
Scalar data type |
V |
Vector data type |
Predicate arguments can be:
Predicate Argument |
Description |
---|---|
B |
Scalar Boolean |
BV |
Vector Boolean |
Intrinsic usage example:
bool256 bv_u16_cmp_leq_v_v_b(ushort128 a,ushort128 b, bool
predicate,bool predicatePolarity);
bool256 bv_f32_cmp_leq_v_s_vb(float64 a, float b, bool256 predicate,
bool predicatePolarity);
float64 v_f32_mul_v_v_b(float64 a, float64 b, bool predicate, bool
predicatePolarity);
Built-in Vector Reduction Intrinsics¶
Vector Reduction Intrinsics provide an easy way to compute the summation, product, minimum, maximum, argmin and argmax of a vector. The vector values are reduced to a single value, and then it is broadcasted to all lanes of the result vector.
The table below describes the available built-in reduction intrinsics for different data types.
Table 3: Built-in Reduction Intrinsics
Reduction Intrinsics |
Description |
---|---|
float64 v_f32_reduce_add(float64 x) |
Summation of all elements of the F32 vector |
float64 v_f32_reduce_mul(float64 x) |
Product of all elements of the F32 vector |
float64 v_f32_reduce_min(float64 x) |
Minimum value of all elements of the F32 vector |
float64 v_f32_reduce_max(float64 x) |
Maximum value of all elements of the F32 vector |
uint64_float64_pair_t v_f32_reduce_argmin(float64 x) |
Index of the minmum value of all elements of the F32 vector |
uint64_float64_pair_t v_f32_reduce_argmax(float64 x) |
Index of the maximum value of all elements of the F32 vector |
int64 v_i32_reduce_add(int64 x) |
Summation of all elements of the I32 vector |
int64 v_i32_reduce_max(int64 x) |
Maximum value of all elements of the I32 vector |
uint64_int64_pair_t v_i32_reduce_argmin(int64 x) |
Index of the minmum value of all elements of the I32 vector |
uint64_int64_pair_t v_i32_reduce_argmax(int64 x) |
Index of the maximum value of all elements of the I32 vector |
bfloat128 v_bf16_reduce_add(bfloat128 x) |
Summation of all elements of the BF16 vector |
bfloat128 v_bf16_reduce_min(bfloat128 x) |
Minimum value of all elements of the BF16 vector |
bfloat128 v_bf16_reduce_max(bfloat128 x) |
Maximum value of all elements of the BF16 vector |
short128 v_i16_reduce_min(short128 x) |
Minimum value of all elements of the I16 vector |
short128 v_i16_reduce_max(short128 x) |
Maximum value of all elements of the I16 vector |
char256 v_i8_reduce_min(char256 x) |
Minimum value of all elements of the I8 vector |
char256 v_i8_reduce_max(char256 x) |
Maximum value of all elements of the I8 vector |
uchar256 v_u8_reduce_min(uchar256 x) |
Minimum value of all elements of the U8 vector |
uchar256 v_u8_reduce_max(uchar256 x) |
Maximum value of all elements of the U8 vector |
Exceptions to C99 standard¶
Initialization of Bool256 Variable¶
The compiler regards Bool256 as an array of chars with length of 32. To initialize all bits of the array to one, use the following syntax:
bool256 a = {0xff} ;
Initialization of Local Memory¶
According to C99, “If an object that has static or thread storage duration is not initialized explicitly and if it has arithmetic type, it is initialized to (positive or unsigned) zero”.
For performance considerations, local memory is left un-utilized in the beginning of a program, although having static storage duration.