TensorFlow Gaudi Integration Architecture¶
Habana integrates the TensorFlow framework with SynapseAI compiler in
a plugin form through
library modules and custom ops/kernels.
The framework integration includes three main components:
The publicly available TensorFlow version can be used without any changes, allowing you to run models on Gaudi using this integration library. After you launch the training model on the HPU (Habana Processing Unit) with some minor changes in your Python scripts (see more details in Porting a Simple TensorFlow Model to Gaudi), the software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training.
SynapseAI helpers library wraps some common flows and constructions in a RAII style interface and serves as a bridge library between the framework and the SynapseAI library (C-API). The HPU integration registers Habana ops as TensorFlow custom ops on the HPU device. The SynapseAI helpers library also manages memory allocations on device, mapping host memory to the device, DMA transfers between device and host, and streams. It uses the TensorFlow BFC Allocator for fast access to Gaudi memory allocation and deallocation.
Gaudi supports TensorFlow ops with the following data type:
The data type support is specified during the op registration in TensorFlow. To see the currently supported TensorFlow op list on HPU, refer to TensorFlow Operators .
You can convert FP32 to BF16 data type in the python
model code or automatically for selected ops that can be computed
with low precision. The second approach is similar to the Auto Mixed
Precision conversion pass conducted in TensorFlow. For automatic
conversion from FP32 to BF16 data type, enable the default
conversion using runtime environment variable
For further details, see Runtime Environment Variables.
The TensorFlow framework controls most of the objects required for graph
build or graph execution. SynapseAI allows users to create, compile, and
launch graph on the device. The Graph passes library optimizes the
TensorFlow graph with operations of Pattern Matching, Marking,
Segmentation, and Encapsulation (PAMSEN). It is designed to manipulate
the TensorFlow graph in order to maximally utilize Gaudi’s HW
resources. Given a collection of graph nodes that have implementation
for Gaudi, PAMSEN tries to merge as much graph nodes as possible,
maintaining graph correctness. By preserving graph semantics and
automatically discovering subgraphs that can be fused into one entity,
PAMSEN delivers performance that should be on a par with
(or exceed) a native TensorFlow level. As XLA (Accelerated Linear
Algebra) does, PAMSEN takes
account when making decisions about merging nodes in the graph to
maintain graph correctness and to make sure that it is not executed in a
different way than expected.
In addition, the optimization pass determines op placement on devices (CPU or HPU), data precision down cast (like int64->int32, FP32->BF16,), and runtime constant folding. It also rewrites the TensorFlow size op to Habana size op, converts TensorFlow collective ops to HPU collective ops, and adds control edges between collectives.
The HPU collective ops are implemented using the Habana Collective Communications Library (HCCL), which is used to perform communication among different Gaudi cards. For further details, see Habana Collective Communications Library (HCCL) API Reference.
Distributed training on Habana Gaudi cards is supported with Horovod and HPUStrategy. See more details about the TensorFlow distributed training on Gaudi in Distributed Training with TensorFlow.
Keras is an open-source python library which provides many common building blocks to ease development of deep neural network code. See TensorFlow Keras for further details.
In some cases, like unsupported dimensionality of tensors, subgraphs collected by PAMSEN cannot be compiled and delegated to the accelerator. The computational graph is delegated to the CPU instead. In such cases, a warning is emitted to the logs:
2021-05-24 23:16:34.331557: W simple_fallback_runner.cpp:39] Delegating node=HABANA_GRAPH_SPECIFIC_NAME to CPU
Such situations can introduce a performance penalty. Computations moved to the CPU can have changed precision from bfloat16 to float32.