TPC-C Language

The TPC compiler accepts a derivative of the C99 standard C language as input.

Built-in Types

The language now has several new built-in types to support the SIMD capabilities of the processor. The table below describes them.

Table 1: Extended TPC Data Types

Type

Description

tensor

Opaque handle pointing to a tensor object.

int5

5-dimensional Cartesian coordinates pointing into a tensor.

float64

64-element vector; each element is a 4-byte single precision value.

bfloat128

128-element vector; each element is a 2-byte floating point value.

ushort128

128-element vector; each element is a 2-byte signed/unsigned integer value.

int64

64-element vector; each element is a 4-byte integer value.

uint64

64-element vector; each element is a 4-byte unsigned integer value.

char256

256-element vector; each element is a 1-byte integer value.

uchar256

256-element vector; each element is a 1-byte integer value.

bool256

256-element vector; each element is a 1-bit value. Only logical operations are supported with this type.

Global Memory Space

The global memory space maps to memory external to the TPC processor.

The following apply to global memory:

  • Tensor objects are always nested in global memory.

  • Only built-in gen_addr intrinsics can initialize pointers to global memory. They are immutable. Their pointed address cannot be changed after initialization.

  • The __global__ address space qualifier appends pointers to global memory.

  • Only pointers to scalar data types can be initialized.

  • Global memory is not coherent. Call the aso intrinsic when performing a read-after-write operation.

  • Global memory cannot be statically allocated at compile time nor dynamically allocated using C runtime functions, such as malloc/free. Intel Gaudi runtime pre-allocates tensors before program execution.

__local__ int localArray[5];
void main (tensor t1)

{
 int5 offset = {0,1,2,3,3};
 __global__ int* pointer = a_gen_addr_i_b(t1, offset);
 int tmp = *(pointer);
 tmp = tmp + localArray[0];
 *pointer = tmp;
}

// Illegal syntax global pointer cannot point to local memory.
__global__ int* pointer = &(localArray[1]);
// Illegal declaration - the program cannot statically allocate global
  memory.
__global__ int64 array [64];

Local Memory Space

The local memory space is a private, adjacent memory space to the TPC processor memory. Each TPC processor has its own copy of local memory. Local memory offers improved latency and bandwidth on repetitive read/write operations.

Local memory is statically allocated at compile time through definition of global variables bearing the __local__ address space qualifier. See the example below for reference. The following apply to local memory:

  • Local memory is sequentially consistent with program instance execution. Read-after‑write memory barrier instructions are not needed.

  • Local memory can only be allocated statically at compile time.

  • There are two banks of local memory:

    • The local memory size for scalar types is 1 KB.

    • The local memory size for vector types is either 16 KB or 80 KB. If the program uses special functions, the available VLM size is reduced to 16 KB.

__local__ float64 polynom_constants[3];

void main(tensor inputA, tensor inputB, tensor outputC)
{
    int5 targetCoord = { 0 };
    polynom_constants[0]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
    targetCoord[0] += 1;
    polynom_constants[1]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
    targetCoord[0] += 1;
    polynom_constants[2]=v_f32_ld_tnsr_i_b(targetCoord,inputA,1,0);
    // use 'polynom_constants’ here
}

Built-in Global Variables

The built-in global variables can be accessed using the following functions:

  • Accessing these registers (LFSR is for vector, S_LFSR is for scalar) causes a destructive read. Reading from this variable yields a different, uniformly pseudo-random result on each access. It is possible to write to this variable, seeding is therefore supported.

    char256 seed;
    write_lfsr_b(seed);
    
  • These variables (LFSR_NO_CHANGE is for vector, S_LFSR_NO_CHANGE is for scalar) return the next value to be returned from LFSR. Reading from the variable does not affect LFSR content. Updated the doc, fixed some errors and added a few contents.

    uint64 lane_id_32 = read_lane_id_4b_b();

    • read_lane_id_4b_b() retuns a 32-bit-wide lane_id vector.

      This vector has 64 elements, 32 bits each, and assigned with 0,1,2… and gets to 63.

      0

      1

      2

      61

      62

      63

      This function generates a predicate bitmask for a single element in a 32-bit-wide vector type. For example:

      bool256 mask = bv_u32_cmp_eq_v_s(lane_id_32, j); // 0 <= j <= 63
      

      In conjunction with move:

      float64 tmpV = v_f32_mov_s_vb(tmpS, tmpV, mask, 0);
      
      ushort128 lane_id_16 = read_lane_id_2b_b();
      
    • read_lane_id_2b_b() retuns a 16-bit-wide lane_id vector.

      This vector has 128 elements, 16 bits each, and assigned with 0,1,2… and gets to 127.

      0

      1

      2

      125

      126

      127

      uchar256 lane_id_8 = read_lane_id_1b_b();
      
    • read_lane_id_1b_b() retuns a 8-bit-wide lane_id vector.

      This vector has 256 elements, 8 bits each, and assigned with 0,1,2… and gets to 255.

      0

      1

      2

      253

      254

      255