Processor Architectural Overview¶

Instruction Slots and Processor Pipeline¶

The TPC processor has four execution slots:

Load slot - loads from memory, moves and set values.
SPU slot - performs scalar arithmetic.
VPU slot - performs vector arithmetic.
Store slot - stores to memory, moves and set values.

../../_images/Example_TPC_Instruction_Assembly.jpg

Figure 26 Example TPC Instruction Assembly – LOAD, SPU, VPU and STORE¶

TPC has an exposed pipeline architecture. Each instruction has a predefined latency, with four cycles being the most prevalent latency. Its result is visible to the software immediately after the defined latency period.

For example, the latency of multiplication instruction (MUL) is four cycles. In this case, the following code is legal:

Initial values are V0 = 0, V1 = 1, V2= 2.
⇓ MUL V0, V1, V2 // V0 = V1*V2 -> V0 == 2.
⇓ MUL V3, V0, 4 // V3 is equal to 0. V0 has not yet been updated.
⇓ MUL V4, V0, 4 // V4 is equal to 0. V0 has not yet been updated.
⇓ MUL V5, V0, 4 // V5 is equal to 0. V0 has not yet been updated.
⇓ MUL V6, V0, 4 // V6 is equal to 8. The first multiplication result is visible.

Predication¶

All instructions in the TPC core can be predicated. Each VLIW slot is predicated in a different way:

The SPU and store slots support only scalar predication.
The VPU and Load slots can be predicated either by a single scalar value or by a bit array enabling masking of specific vector elements.

Predication is exposed to the TPC-C programmer through intrinsics.

Memory Spaces¶

The TPC processor has four memory spaces:

Scalar Local Memory
Vector Local Memory
Global Memory
Configuration space

Global Memory¶

Global memory is accessed using dedicated accessors called tensors. For more details about tensors, see TPC Programming Model.

Global memory is not coherent with program execution. This means that the program must issue an atomic semaphore operation when performing a read-after-write operation, in order to guarantee that the write operation result is visible before reading it back. A 2,048-bit vector can be loaded from or written to global memory every four cycles, on average.

Local Memory¶

Each TPC processor has its own instance of local memory. Each TPC can only access its own local copy. That is, TPC A cannot access TPC B local memory.

Local memory is coherent with program execution and is divided to two banks:

Scalar local memory:
- Size is 1 KB.
- Reading/writing to this memory is allowed in aligned 4-byte chunks.
Vector local memory:
- Size is 80 KB. If the program utilizes special functions such as tanh, sin, or cos, only 16 KBs are available.
- Reading/writing to this memory is allowed in aligned 128-/256-byte chunks.

Local memory can be either read from or written to on every cycle with no BW constraint.

Configuration Space¶

The TPC configuration space holds a set of definitions required to successfully execute a program such as tensor descriptors, program binary location, etc. The TPC Programming Reference Manual further describes the structure of the configuration space. Under normal circumstances, a program should not modify the content of the configuration space.

Gaudi Documentation 1.21.1 documentation

Processor Architectural Overview

On this Page