TPC-C Language
On this Page
TPC-C Language¶
The TPC compiler accepts a derivative of the C99 standard C language as input.
Built-in Types¶
The language now has several new built-in types to support the SIMD capabilities of the processor. The table below describes them.
Table 1: Extended TPC Data Types
Type |
Description |
---|---|
tensor |
Opaque handle pointing to a tensor object. |
int5 |
5-dimensional Cartesian coordinates pointing into a tensor. |
float64 |
64-element vector; each element is a 4-byte single precision value. |
bfloat128 |
128-element vector; each element is a 2-byte floating point value. |
ushort128 |
128-element vector; each element is a 2-byte signed/unsigned integer value. |
int64 |
64-element vector; each element is a 4-byte integer value. |
uint64 |
64-element vector; each element is a 4-byte unsigned integer value. |
char256 |
256-element vector; each element is a 1-byte integer value. |
uchar256 |
256-element vector; each element is a 1-byte integer value. |
bool256 |
256-element vector; each element is a 1-bit value. Only logical operations are supported with this type. |
Global Memory Space¶
The global memory space maps to memory external to the TPC processor.
The following apply to global memory:
Tensor objects are always nested in global memory.
Only built-in
gen_addr
intrinsics can initialize pointers to global memory. They are immutable. Their pointed address cannot be changed after initialization.The
__global__
address space qualifier appends pointers to global memory.Only pointers to scalar data types can be initialized.
Global memory is not coherent. Call the
aso
intrinsic when performing a read-after-write operation.Global memory cannot be statically allocated at compile time nor dynamically allocated using C runtime functions, such as malloc/free. Intel Gaudi runtime pre-allocates tensors before program execution.
__local__ int localArray[5];
void main (tensor t1)
{
int5 offset = {0,1,2,3,3};
__global__ int* pointer = a_gen_addr_i_b(t1, offset);
int tmp = *(pointer);
tmp = tmp + localArray[0];
*pointer = tmp;
}
// Illegal syntax global pointer cannot point to local memory.
__global__ int* pointer = &(localArray[1]);
// Illegal declaration - the program cannot statically allocate global
memory.
__global__ int64 array [64];
Local Memory Space¶
The local memory space is a private, adjacent memory space to the TPC processor memory. Each TPC processor has its own copy of local memory. Local memory offers improved latency and bandwidth on repetitive read/write operations.
Local memory is statically allocated at compile time through definition of global variables bearing the __local__ address space qualifier. See the example below for reference. The following apply to local memory:
Local memory is sequentially consistent with program instance execution. Read-after‑write memory barrier instructions are not needed.
Local memory can only be allocated statically at compile time.
There are two banks of local memory:
The local memory size for scalar types is 1 KB.
The local memory size for vector types is either 16 KB or 80 KB. If the program uses special functions, the available VLM size is reduced to 16 KB.
__local__ float64 polynom_constants[3];
void main(tensor inputA, tensor inputB, tensor outputC)
{
int5 targetCoord = { 0 };
polynom_constants[0]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
targetCoord[0] += 1;
polynom_constants[1]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
targetCoord[0] += 1;
polynom_constants[2]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
// use 'polynom_constants’ here
}
Built-in Global Variables¶
The built-in global variables can be accessed using the following functions:
LFSR global variable - Accessing this register causes a destructive read. Reading from this variable yields a different pseudo-random result on each access.
char256 lfsr = read_lfsr_b();
It is possible to write to this variable, therefore seeding is supported.
volatile char256 LFSR; volatile int S_LFSR;
Accessing these registers (LFSR is for vector, S_LFSR is for scalar) causes a destructive read. Reading from this variable yields a different, uniformly pseudo-random result on each access. It is possible to write to this variable, seeding is therefore supported.
char256 seed; write_lfsr_b(seed);
LANE_ID global variables - 3 read only vector registers that are assigned with the lane IDs of 8-bit, 16-bit, and 32-bit vectors.
const volatile char256 LFSR_NO_CHANGE; const volatile int S_LFSR_NO_CHANGE
These variables (LFSR_NO_CHANGE is for vector, S_LFSR_NO_CHANGE is for scalar) return the next value to be returned from LFSR. Reading from the variable does not affect LFSR content. Updated the doc, fixed some errors and added a few contents.
uint64 lane_id_32 = read_lane_id_4b_b();
read_lane_id_4b_b() retuns a 32-bit-wide lane_id vector.
This vector has 64 elements, 32 bits each, and assigned with 0,1,2… and gets to 63.
0
1
2
…
61
62
63
This function generates a predicate bitmask for a single element in a 32-bit-wide vector type. For example:
bool256 mask = bv_u32_cmp_eq_v_s(lane_id_32, j); // 0 <= j <= 63
In conjunction with move:
float64 tmpV = v_f32_mov_s_vb(tmpS, tmpV, mask, 0);
ushort128 lane_id_16 = read_lane_id_2b_b();
read_lane_id_2b_b() retuns a 16-bit-wide lane_id vector.
This vector has 128 elements, 16 bits each, and assigned with 0,1,2… and gets to 127.
0
1
2
…
125
126
127
uchar256 lane_id_8 = read_lane_id_1b_b();
read_lane_id_1b_b() retuns a 8-bit-wide lane_id vector.
This vector has 256 elements, 8 bits each, and assigned with 0,1,2… and gets to 255.
0
1
2
…
253
254
255