TensorFlow Gaudi Integration Architecture

Overview

Habana integrates the TensorFlow framework with SynapseAI compiler in a plugin form through tf.load_library and tf.load_op_library, calling library modules and custom ops/kernels.

The framework integration includes three main components:

  • SynapseAI helpers

  • Device

  • Graph passes

The publicly available TensorFlow version can be used without any changes, allowing you to run models on Gaudi using this integration library. After you launch the training model on the HPU (Habana Processing Unit) with some minor changes in your Python scripts (see more details in Porting a Simple TensorFlow Model to Gaudi), the software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training.

SynapseAI helpers library wraps some common flows and constructions in a RAII style interface and serves as a bridge library between the framework and the SynapseAI library (C-API). The HPU integration registers Habana ops as TensorFlow custom ops on the HPU device. The SynapseAI helpers library also manages memory allocations on device, mapping host memory to the device, DMA transfers between device and host, and streams. It uses the TensorFlow BFC Allocator for fast access to Gaudi memory allocation and deallocation.

Supported Data Types

Gaudi supports TensorFlow ops with the following data type:

  • FP32

  • BF16

  • Int32

  • Int8

  • Boolean

The data type support is specified during the op registration in TensorFlow. To see the currently supported TensorFlow op list on HPU, refer to TensorFlow Operators .

You can convert FP32 to BF16 data type in the python model code or automatically for selected ops that can be computed with low precision. The second approach is similar to the Auto Mixed Precision conversion pass conducted in TensorFlow. For automatic conversion from FP32 to BF16 data type, enable the default conversion using runtime environment variable TF_BF16_CONVERSION. For further details, see Runtime Environment Variables.

Graph Compilation

The TensorFlow framework controls most of the objects required for graph build or graph execution. SynapseAI allows users to create, compile, and launch graph on the device. The Graph passes library optimizes the TensorFlow graph with operations of Pattern Matching, Marking, Segmentation, and Encapsulation (PAMSEN). It is designed to manipulate the TensorFlow graph in order to maximally utilize Gaudi’s HW resources. Given a collection of graph nodes that have implementation for Gaudi, PAMSEN tries to merge as much graph nodes as possible, maintaining graph correctness. By preserving graph semantics and automatically discovering subgraphs that can be fused into one entity, PAMSEN delivers performance that should be on a par with (or exceed) a native TensorFlow level. As XLA (Accelerated Linear Algebra) does, PAMSEN takes graph_cycles and deadness_analysis into account when making decisions about merging nodes in the graph to maintain graph correctness and to make sure that it is not executed in a different way than expected.

In addition, the optimization pass determines op placement on devices (CPU or HPU), data precision down cast (like int64->int32, FP32->BF16,), and runtime constant folding. It also rewrites the TensorFlow size op to Habana size op, converts TensorFlow collective ops to HPU collective ops, and adds control edges between collectives.

The HPU collective ops are implemented using the Habana Collective Communications Library (HCCL), which is used to perform communication among different Gaudi cards. For further details, see Habana Collective Communications Library (HCCL) API Reference.

Distributed training on Habana Gaudi cards is supported with Horovod and HPUStrategy. See more details about the TensorFlow distributed training on Gaudi in Distributed Training with TensorFlow.

TensorFlow Keras

Keras is an open-source python library which provides many common building blocks to ease development of deep neural network code. See TensorFlow Keras for further details.

Delegating Computations to CPU

In some cases, like unsupported dimensionality of tensors, subgraphs collected by PAMSEN cannot be compiled and delegated to the accelerator. The computational graph is delegated to the CPU instead. In such cases, a warning is emitted to the logs:

2021-05-24 23:16:34.331557: W simple_fallback_runner.cpp:39] Delegating node=HABANA_GRAPH_SPECIFIC_NAME to CPU

Such situations can introduce a performance penalty. Computations moved to the CPU can have changed precision from bfloat16 to float32.