TPC Coherency¶

TPC has a scalar data-cache to keep coherency between its scalar load/store accesses to global memory, but it does not have a vector data-cache for its vector accesses to global memory. Vector loads are issued in scalar-pipe, a special HW mechanism hiding the global memory latency, and the data is read later during the vector-pipe. If a load arrives in the vector pipe before the data is ready, the vector pipe stalls. Vector stores are issued in vector-pipe after the vector data is ready. Therefore, store that comes after load will always keep data coherency (because it is issued only after the load data returns to the TPC). A vector load that comes in the code after a vector store to the same address is not coherent. It is unknown whether the load data that returns is the old data (before the store) or the new data (after the store, as it should). In that case, SPU pipe must be stalled until the vector store is complete (the data is written back to the global memory). The ASO (Atomic Semaphore Operation) instruction performs the stall. It ensures the TPC commits all older writes (at vector pipe) prior to updating the semaphore (at the scalar pipe). When coherency between vector load and scalar store accesses is required, use explicit fencing (cache_invalidate).

The following table summarizes all global load and store instructions, and the HW support for coherency in each case.

Note

st_tnsr* refers to st_tnsr, st_tnsr_low, and st_tnsr_high.

Table 4: Different Coherency Cases

Older instruction	Younger instruction	Coherency kept by HW	Comment
ld_tnsr/ ld_g to vrf	st_tnsr*	Yes	Pipeline structure (ld_tnsr will always retire before younger st_tnsr is issued)
st_tnsr*	ld_tnsr/ ld_g to vrf	No
st_tnsr*	ld_g scalar/ prefetch	No

Gaudi Documentation 1.21.1 documentation

TPC Coherency

TPC Coherency¶