Built-in Functions

Program Management Special Functions

The following lists the available program management special functions:

int5 get_index_space_offset()

Parameter

Description

return value

Returns the index space offset for the current program invocation.

int5 get_index_space_size()

Parameter

Description

return value

Returns the index space size for the current program invocation.

unsigned int get_dim_size(tensor a, unsigned int dim)

Parameter

Description

a

[in] Tensor handle.

dim

[in] Tensor dimension index to be queried.

return value

Tensor dimension size, in elements.

unsigned int get_dim_stride(tensor a, unsigned int dim)

Parameter

Description

a

[in] Tensor handle.

dim

[in] Tensor dimension index to be queried.

return value

Tensor dimension stride, in elements.

unsigned int get_pad_value_<tensor data type>(tensor a)

Parameter

Description

a

[in] Tensor handle.

return value

Tensor’s pad value.

This function supports the following data types:

  • uint

  • int

  • float

  • bf16

  • short

  • ushort

  • char

  • uchar

void set_pad_value_<tensor data type>(tensor a,<tensor data type> val)

Parameter

Description

a

[in] Tensor handle.

val

New pad value to set.

This function supports the following data types:

  • uint

  • int

  • float

  • bf16

  • short

  • ushort

  • char

  • uchar

Built-in Special Functions

Table 2 describes the available built-in special functions.

To use special functions in your TPC-C code, set the following flag:

specialFunctionsUsed = 1 in the glue-code

Table 2: Built-in Special Functions

Function

Single-precision Floating Point – max ULPs

float64 v_reciprocal_f32(float64 x)

2

float64 v_sqrt_f32(float64 x)

2

float64 v_exp_f32(float64 x)

2

float64 v_exp_cephes_f32(float64 x)

1

float64 v_log_f32(float64 x)

3

float64 v_log2_f32(float64 x)

3

float64 v_tanh_f32(float64 x)

3

float64 v_pow_f32(float64 x, float64 y)

20

float64 v_pow2_f32(float64 x)

2

float64 v_rsqrt_f32(float64 x)

3

float64 v_div_f32(float64 x, float64 y)

2

float64 v_sin_f32(float64 x)

2

float64 v_cos_f32(float64 x)

2

float64 v_tan_f32(float64 x)

3

float64 v_sigmoid_f32(float64 input)

16

float64 v_asin_cephes_f32(float64 input)

2

float64 v_acos_cephes_f32(float64 input)

2

float64 v_atan_cephes_f32(float64 input)

3

float64 v_asinh_f32(float64 input)

6

float64 v_acosh_f32(float64 input)

10

float64 v_atanh_f32(float64 input)

3

float64 v_sinh_cephes_f32(float64 input)

3

float64 v_cosh_cephes_f32(float64 input)

3

float64 v_mod_f32(float64 input)

70

float64 v_expm1_f32(float64 input)

10

INT8/INT16 Built-in Special Functions

The following lists the available INT8/INT16 built-in special functions:

  • int8 tanh(int8 a);

  • int16 tanh(int16 a);

  • int8 sigmoid(int8 a);

  • int16 sigmoid(int16 a);

  • int8 exp(int8 a); // for X < 0

  • int16 exp (int16 a); // for X < 0

  • 1/x for x in [0.5 , 1)

Built-in Shuffle Functions

Shuffle functions perform elementwise operations on a vector and are available for 8bit, 16bit and 32bit elements.

The following sections list the available shuffle functions.

v_broadcast_element

Parameter

Description

input

[in] Vector.

N

[in] LANE_ID of the element to broadcast. 0 <= N <= max lane id.

return value

Vector with the selected element broadcast to all lanes.

  • char256 v_broadcast_element_i8(char256 input, int N)

  • uchar256 v_broadcast_element_u8(uchar256 input, int N)

  • short128 v_broadcast_element_i16(short128 input, int N)

  • ushort128 v_broadcast_element_u16(ushort128 input, int N)

  • bfloat128 v_broadcast_element_bf16(bfloat128 input, int N)

  • half128 v_broadcast_element_f16(half128 input, int N) (for supported devices only)

  • int64 v_broadcast_element_i32(int64 input, int N)

  • uint64 v_broadcast_element_u32(uint64 input, int N)

v_element_shift_up

Parameter

Description

input

[in] Vector.

N

[in] The shift value. 0 < N <= max lane id.

return value

Vector with the elements shifted up to the Nth neighbor. The shift is cyclic.

  • char256 v_element_shift_up_i8(char256 input, int N)

  • uchar256 v_element_shift_up_u8(uchar256 input, int N)

  • short128 v_element_shift_up_i16(short128 input, int N)

  • ushort128 v_element_shift_up_u16(ushort128 input, int N)

  • bfloat128 v_element_shift_up_bf16(bfloat128 input, int N)

  • half128 v_element_shift_up_f16(half128 input, int N) (for supported devices only)

  • int64 v_element_shift_up_i32(int64 input, int N)

  • uint64 v_element_shift_up_u32(uint64 input, int N)

v_element_shift_down

Parameter

Description

input

[in] Vector.

N

[in] The shift value. 0 < N <= max lane id.

return value

Vector with the elements shifted down to the Nth neighbor. The shift is cyclic.

  • char256 v_element_shift_down_i8(char256 input, int N)

  • uchar256 v_element_shift_down_u8(uchar256 input, int N)

  • short128 v_element_shift_down_i16(short128 input, int N)

  • ushort128 v_element_shift_down_u16(ushort128 input, int N)

  • bfloat128 v_element_shift_down_bf16(bfloat128 input, int N)

  • half128 v_element_shift_down_f16(half128 input, int N) (for supported devices only)

  • int64 v_element_shift_down_i32(int64 input, int N)

  • uint64 v_element_shift_down_u32(uint64 input, int N)

v_element_shift_xor

Parameter

Description

input

[in] Vector.

N

[in] The lane id for xor operation. 0 < N <= max lane id.

return value

Vector with elements copied from a lane based on bitwise xor with own lane id.

  • char256 v_element_shift_xor_i8(char256 input, int N)

  • uchar256 v_element_shift_xor_u8(uchar256 input, int N)

  • short128 v_element_shift_xor_i16(short128 input, int N)

  • ushort128 v_element_shift_xor_u16(ushort128 input, int N)

  • bfloat128 v_element_shift_xor_bf16(bfloat128 input, int N)

  • half128 v_element_shift_xor_f16(half128 input, int N) (for supported devices only)

  • int64 v_element_shift_xor_i32(int64 input, int N)

  • uint64 v_element_shift_xor_u32(uint64 input, int N)

Intrinsics

Every TPC instruction is wrapped with an intrinsic for every supported data type and scalar/vector argument combination.

The intrinsic function name is usually derived from the instruction name, instruction data type, return data type width, scalar/vector properties of its arguments and predicate values.

The intrinsic naming convention adheres to the following pattern:

<return type width>_<instruction datatype>_<instruction name>_<arg1
width>_<arg2 width>_<b|bv>( arguments… );
  • The return type width can be:

<return type width>

Description

V

Vector type

AV

Augmented vector (4096-bit or 8192-bit vectors)

S

Scalar type

B

Boolean data type

BV

Boolean vector data type

  • The instruction type can be:

<instruction datatype>

Description

F32

Single-precision floating point

I32

32-bit signed integer

U32

32-bit unsigned integer

BF16

Brain floating point

I16

16-bit signed integer

U16

16-bit unsigned integer

I8

8-bit signed integer

U8

8-bit unsigned integer

I

INT5 data type

  • The argument width can be:

< arg width>

Descrtipiton

S

Scalar data type

V

Vector data type

  • Predicate arguments can be:

Predicate Argument

Description

B

Scalar Boolean

BV

Vector Boolean

Intrinsic usage example:

bool256 bv_u16_cmp_leq_v_v_b(ushort128 a,ushort128 b, bool
predicate,bool predicatePolarity);

bool256 bv_f32_cmp_leq_v_s_vb(float64 a, float b, bool256 predicate,
bool predicatePolarity);

float64 v_f32_mul_v_v_b(float64 a, float64 b, bool predicate, bool
predicatePolarity);

Built-in Vector Reduction Intrinsics

Vector Reduction Intrinsics provide an easy way to compute the summation, product, minimum, maximum, argmin and argmax of a vector. The vector values are reduced to a single value, and then it is broadcasted to all lanes of the result vector.

The table below describes the available built-in reduction intrinsics for different data types.

Table 3: Built-in Reduction Intrinsics

Reduction Intrinsics

Description

float64 v_f32_reduce_add(float64 x)

Summation of all elements of the F32 vector

float64 v_f32_reduce_mul(float64 x)

Product of all elements of the F32 vector

float64 v_f32_reduce_min(float64 x)

Minimum value of all elements of the F32 vector

float64 v_f32_reduce_max(float64 x)

Maximum value of all elements of the F32 vector

uint64_float64_pair_t v_f32_reduce_argmin(float64 x)

Index of the minmum value of all elements of the F32 vector

uint64_float64_pair_t v_f32_reduce_argmax(float64 x)

Index of the maximum value of all elements of the F32 vector

int64 v_i32_reduce_add(int64 x)

Summation of all elements of the I32 vector

int64 v_i32_reduce_max(int64 x)

Maximum value of all elements of the I32 vector

uint64_int64_pair_t v_i32_reduce_argmin(int64 x)

Index of the minmum value of all elements of the I32 vector

uint64_int64_pair_t v_i32_reduce_argmax(int64 x)

Index of the maximum value of all elements of the I32 vector

bfloat128 v_bf16_reduce_add(bfloat128 x)

Summation of all elements of the BF16 vector

bfloat128 v_bf16_reduce_min(bfloat128 x)

Minimum value of all elements of the BF16 vector

bfloat128 v_bf16_reduce_max(bfloat128 x)

Maximum value of all elements of the BF16 vector

short128 v_i16_reduce_min(short128 x)

Minimum value of all elements of the I16 vector

short128 v_i16_reduce_max(short128 x)

Maximum value of all elements of the I16 vector

char256 v_i8_reduce_min(char256 x)

Minimum value of all elements of the I8 vector

char256 v_i8_reduce_max(char256 x)

Maximum value of all elements of the I8 vector

uchar256 v_u8_reduce_min(uchar256 x)

Minimum value of all elements of the U8 vector

uchar256 v_u8_reduce_max(uchar256 x)

Maximum value of all elements of the U8 vector

Exceptions to C99 standard

Initialization of Bool256 Variable

The compiler regards Bool256 as an array of chars with length of 32. To initialize all bits of the array to one, use the following syntax:

bool256 a = {0xff} ;

Initialization of Local Memory

According to C99, “If an object that has static or thread storage duration is not initialized explicitly and if it has arithmetic type, it is initialized to (positive or unsigned) zero”.

For performance considerations, local memory is left un-utilized in the beginning of a program, although having static storage duration.