Gaudi Architecture

Intel® Gaudi® AI accelerator architecture includes three main subsystems - compute, memory, and networking - and is designed from the ground up for accelerating DL training workloads.

The compute architecture is heterogeneous and includes two compute engines – a Matrix Multiplication Engine (MME) and a fully programmable Tensor Processor Core (TPC) cluster. The MME is responsible for doing all operations which can be lowered to Matrix Multiplication (fully connected layers, convolutions, batched-GEMM) while the TPC, a VLIW SIMD processor tailor-made for deep learning operations, is used to accelerate everything else.

Its heterogeneous architecture comprises a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries, and a configurable Matrix Math engine. The TPC core is a VLIW SIMD processor with instruction set and hardware tailored to serve training workloads efficiently. It is programmable, providing the user with maximum flexibility to innovate, coupled with many workload-oriented features, such as:

  • GEMM operation acceleration

  • Tensor addressing

  • Latency hiding capabilities

  • Random number generation

  • Advanced implementation of special functions

Gaudi architecture is the first DL training processor that has integrated RDMA over Converged Ethernet (RoCE v2) engines on-chip. These engines play a critical role in the inter-processor communication needed during the training process. This native integration of RoCE allows customers to use the same scaling technology, both inside the server and rack (scale-up), as well as to scale across racks (scale-out). These can be connected directly between Gaudi processors, or through any number of standard Ethernet switches.

First-gen Gaudi Processor

The first-gen Gaudi processor offers up to two Terabits of networking bandwidth with the native integration on-chip 10x100Gb integrated RDMA converged Ethernet ports. The memory architecture includes 24 MB of on-die SRAM and 1 TB/s bandwidth of local memories in each TPC. In addition, the chip package integrates four HBM devices with a total of 32 GB of storage. The PCIe interface provides a host interface and supports both generation 3.0 and 4.0 modes.

The TPC core natively supports the following data types: FP32, BF16, INT32, INT16, INT8, UINT32, UINT16 and UINT8.

../_images/Gaudi_Processor_High_Level_Architecture.jpg

Figure 1 First-gen Gaudi Processor High-level Architecture

Gaudi 2 Processor

The Gaudi 2 processor offers 2.4 Terabits of networking bandwidth with the native integration on-chip of 24 x 100 Gbps RoCE V2 RDMA NICs, which enable inter-Gaudi communication via direct routing or via standard Ethernet switching. The Gaudi 2 memory subsystem includes 96 GB of HBM2E memories delivering 2.45 TB/sec bandwidth, in addition to a 48 MB of local SRAM with sufficient bandwidth to allow MME, TPC, DMAs and RDMA NICs to operate in parallel.

Specifically for vision applications, Gaudi 2 has integrated media decoders which operate independently and can handle the entire pre-processing pipe in all popular formats – HEVC, H.264, VP9 & JPEG as well as post-decode image transformations needed to prepare the data for the AI pipeline.

Gaudi 2 supports all popular data types required for deep learning: FP32, TF32, BF16, FP16 & FP8 (both E4M3 and E5M2). In the MME, all data types are accumulated into an FP32 accumulator.

../_images/Gaudi2_Processor_High_Level_Architecture.png

Figure 2 Gaudi 2 Processor High-level Architecture