Gaudi Architecture

Gaudi architecture includes three main subsystems - compute, memory, and networking - and is designed from the ground up for accelerating DL training workloads.

The compute architecture is heterogeneous and includes two compute engines – a Matrix Multiplication Engine (MME) and a fully programmable Tensor Processor Core (TPC) cluster. The MME is responsible for doing all operations which can be lowered to Matrix Multiplication (fully connected layers, convolutions, batched-GEMM) while the TPC, a VLIW SIMD processor tailor-made for deep learning operations, is used to accelerate everything else.

Its heterogeneous architecture comprises a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries, and a configurable Matrix Math engine. The TPC core is a VLIW SIMD processor with instruction set and hardware tailored to serve training workloads efficiently. It is programmable, providing the user with maximum flexibility to innovate, coupled with many workload-oriented features, such as:

  • GEMM operation acceleration

  • Tensor addressing

  • Latency hiding capabilities

  • Random number generation

  • Advanced implementation of special functions

Gaudi architecture is the first DL training processor that has integrated RDMA over Converged Ethernet (RoCE v2) engines on-chip. These engines play a critical role in the inter-processor communication needed during the training process. This native integration of RoCE allows customers to use the same scaling technology, both inside the server and rack (scale-up), as well as to scale across racks (scale-out). These can be connected directly between Gaudi processors, or through any number of standard Ethernet switches.

First-gen Gaudi Processor

The first-gen Gaudi processor offers up to two Terabits of networking bandwidth with the native integration on-chip 10x100Gb integrated RDMA converged Ethernet ports. The memory architecture includes 24 MB of on-die SRAM and 1 TB/s bandwidth of local memories in each TPC. In addition, the chip package integrates four HBM devices with a total of 32 GB of storage. The PCIe interface provides a host interface and supports both generation 3.0 and 4.0 modes.

The TPC core natively supports the following data types: FP32, BF16, INT32, INT16, INT8, UINT32, UINT16 and UINT8.


Figure 1 First-gen Gaudi Processor High-level Architecture

Gaudi2 Processor

The Gaudi2 processor offers 2.4 Terabits of networking bandwidth with the native integration on-chip of 24 x 100 Gbps RoCE V2 RDMA NICs, which enable inter-Gaudi communication via direct routing or via standard Ethernet switching. The Gaudi2 memory subsystem includes 96 GB of HBM2E memories delivering 2.45 TB/sec bandwidth, in addition to a 48 MB of local SRAM with sufficient bandwidth to allow MME, TPC, DMAs and RDMA NICs to operate in parallel.

Specifically for vision applications, Gaudi2 has integrated media decoders which operate independently and can handle the entire pre-processing pipe in all popular formats – HEVC, H.264, VP9 & JPEG as well as post-decode image transformations needed to prepare the data for the AI pipeline.

Gaudi2 supports all popular data types required for deep learning: FP32, TF32, BF16, FP16 & FP8 (both E4M3 and E5M2). In the MME, all data types are accumulated into an FP32 accumulator.


Figure 2 Gaudi2 Processor High-level Architecture