TPC-C Language¶

The TPC compiler accepts a derivative of the C99 standard C language as input.

Built-in Types¶

The language now has several new built-in types to support the SIMD capabilities of the processor. The table below describes them.

Table 1: Extended TPC Data Types

Type	Description
tensor	Opaque handle pointing to a tensor object.
int5	5-dimensional Cartesian coordinates pointing into a tensor.
float64	64-element vector; each element is a 4-byte single precision value.
bfloat128	128-element vector; each element is a 2-byte floating point value.
ushort128	128-element vector; each element is a 2-byte signed/unsigned integer value.
int64	64-element vector; each element is a 4-byte integer value.
uint64	64-element vector; each element is a 4-byte unsigned integer value.
char256	256-element vector; each element is a 1-byte integer value.
uchar256	256-element vector; each element is a 1-byte integer value.
bool256	256-element vector; each element is a 1-bit value. Only logical operations are supported with this type.

Global Memory Space¶

The global memory space maps to memory external to the TPC processor.

The following apply to global memory:

Tensor objects are always nested in global memory.
Only built-in gen_addr intrinsics can initialize pointers to global memory. They are immutable. Their pointed address cannot be changed after initialization.
The __global__ address space qualifier appends pointers to global memory.
Only pointers to scalar data types can be initialized.
Global memory is not coherent. Call the aso intrinsic when performing a read-after-write operation.
Global memory cannot be statically allocated at compile time nor dynamically allocated using C runtime functions, such as malloc/free. Intel Gaudi runtime pre-allocates tensors before program execution.

__local__ int localArray[5];
void main (tensor t1)

{
 int5 offset = {0,1,2,3,3};
 __global__ int* pointer = a_gen_addr_i_b(t1, offset);
 int tmp = *(pointer);
 tmp = tmp + localArray[0];
 *pointer = tmp;
}

// Illegal syntax global pointer cannot point to local memory.
__global__ int* pointer = &(localArray[1]);
// Illegal declaration - the program cannot statically allocate global
  memory.
__global__ int64 array [64];

Local Memory Space¶

The local memory space is a private, adjacent memory space to the TPC processor memory. Each TPC processor has its own copy of local memory. Local memory offers improved latency and bandwidth on repetitive read/write operations.

Local memory is statically allocated at compile time through definition of global variables bearing the __local__ address space qualifier. See the example below for reference. The following apply to local memory:

Local memory is sequentially consistent with program instance execution. Read-after‑write memory barrier instructions are not needed.
Local memory can only be allocated statically at compile time.
There are two banks of local memory:
- The local memory size for scalar types is 1 KB.
- The local memory size for vector types is either 16 KB or 80 KB. If the program uses special functions, the available VLM size is reduced to 16 KB.

__local__ float64 polynom_constants[3];

void main(tensor inputA, tensor inputB, tensor outputC)
{
    int5 targetCoord = { 0 };
    polynom_constants[0]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
    targetCoord[0] += 1;
    polynom_constants[1]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
    targetCoord[0] += 1;
    polynom_constants[2]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
    // use 'polynom_constants’ here
}

Built-in Global Variables¶

The built-in global variables can be accessed using the following functions:

Accessing these registers (LFSR is for vector, S_LFSR is for scalar) causes a destructive read. Reading from this variable yields a different, uniformly pseudo-random result on each access. It is possible to write to this variable, seeding is therefore supported.
char256 seed; write_lfsr_b(seed);

These variables (LFSR_NO_CHANGE is for vector, S_LFSR_NO_CHANGE is for scalar) return the next value to be returned from LFSR. Reading from the variable does not affect LFSR content. Updated the doc, fixed some errors and added a few contents.
uint64 lane_id_32 = read_lane_id_4b_b();
- read_lane_id_4b_b() retuns a 32-bit-wide lane_id vector.
  
  This vector has 64 elements, 32 bits each, and assigned with 0,1,2… and gets to 63.
  
  0
  
  1
  
  2
  
  …
  
  61
  
  62
  
  63
  
  This function generates a predicate bitmask for a single element in a 32-bit-wide vector type. For example:
  bool256 mask = bv_u32_cmp_eq_v_s(lane_id_32, j); // 0 <= j <= 63
  In conjunction with move:
  float64 tmpV = v_f32_mov_s_vb(tmpS, tmpV, mask, 0);
  ushort128 lane_id_16 = read_lane_id_2b_b();
- read_lane_id_2b_b() retuns a 16-bit-wide lane_id vector.
  
  This vector has 128 elements, 16 bits each, and assigned with 0,1,2… and gets to 127.
  
  0
  
  1
  
  2
  
  …
  
  125
  
  126
  
  127
  uchar256 lane_id_8 = read_lane_id_1b_b();
- read_lane_id_1b_b() retuns a 8-bit-wide lane_id vector.
  
  This vector has 256 elements, 8 bits each, and assigned with 0,1,2… and gets to 255.
  
  0
  
  1
  
  2
  
  …
  
  253
  
  254
  
  255

Gaudi Documentation 1.15.1 documentation

TPC-C Language

On this Page

TPC-C Language¶

Built-in Types¶

Global Memory Space¶

Local Memory Space¶

Built-in Global Variables¶