1. TensorFlow User Guide

1.1. Introduction

This documents describes how to run TensorFlow models on the Habana Gaudi infrastructure. It provides guidelines for modifying existing models to run on the platform and uses a basic example to show functionality.

This document does not provided installation instructions. The requirements needed to set up the environment are provided in the TensorFlow Installation section.

1.2. TensorFlow Habana Processing Unit (HPU) Integration Architecture

Habana integrates the TensorFlow framework with SynapseAI compiler in a plugin form through tf.load_library and tf.load_op_library, calling library modules and custom ops/kernels.

The framework integration includes three main components:

  • SynapseAI helpers

  • Device

  • Graph passes

The publicly available TensorFlow version can be used without any changes, allowing you to run models on Gaudi using this integration library. After you launch the training model on the HPU (Habana Processing Unit) with some minor changes in your Python scripts (see more details in Porting a Simple TensorFlow Model to Gaudi), the software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training.

SynapseAI helpers library wraps some common flows and constructions in a RAII style interface and serves as a bridge library between the framework and the SynapseAI library (C-API). The HPU integration registers Habana ops as TensorFlow custom ops on the HPU device. The SynapseAI helpers library also manages memory allocations on device, mapping host memory to the device, DMA transfers between device and host, and streams. It uses the TensorFlow BFC Allocator for fast access to Gaudi memory allocation and deallocation.

1.2.1. Supported Data Types

Gaudi supports TensorFlow ops with the following data type:

  • FP32

  • BF16

  • Int32

  • Int8

  • Boolean

The data type support is specified during the op registration in TensorFlow. To see the currently supported TensorFlow op list on HPU, refer to TensorFlow Operators .

You can convert FP32 to BF16 data type in the python model or automatically for selected ops that can be computed with low precision. The second approach is similar to the Auto Mixed Precision conversion pass conducted in TensorFlow. For automatic conversion from FP32 to BF16 data type, enable the default conversion recipe by a single runtime flag TF_ENABLE_BF16_CONVERSION (see more details in Runtime Flags).

1.2.2. Graph Compilation

The framework, TensorFlow, controls most of the objects required for graph build or graph execution. SynapseAI allows users to create, compile, and launch graph on the device. The Graph passes library optimizes the TensorFlow graph with operations of Pattern Matching, Marking, Segmentation, and Encapsulation (PAMSEN). It is designed to manipulate the TensorFlow graph in order to maximally utilize Gaudi’s HW resources. Given a collection of graph nodes that have implementation for Gaudi, PAMSEN tries to merge as much graph nodes as possible, maintaining graph correctness. By preserving graph semantics and automatically discovering subgraphs that can be fused into one entity, PAMSEN delivers performance that should be on a par with (or exceed) a native TensorFlow level. As XLA (Accelerated Linear Algebra) does, PAMSEN takes graph_cycles and deadness_analysis into account when making decisions about merging nodes in the graph to maintain graph correctness and to make sure that it is not executed in a different way than expected.

In addition, the optimization pass determines op placement on devices (CPU or HPU), data precision down cast (like int64->int32, FP32->BF16,), and runtime constant folding. It also rewrites the size to Habana size, converts TensorFlow collective ops to HPU collective ops, and adds control edges between collectives.

The collective ops are implemented using the Habana Communication Library (HCL), which is used to perform communication among different Gaudi cards. For further details, see Habana Communication Library (HCL) API Reference. The TensorFlow HPU integration also supports NCCL-compatible APIs through the Habana Collective Communication Library (HCCL). For further details, see Habana Collective Communications Library (HCCL) API Reference.

Distributed support is provided with the Horovod and HPUStrategy class. See more details about the TensorFlow distributed training on Gaudi in Distributed Training with TensorFlow.

1.2.3. Delegating Computations to CPU

In some cases, like unsupported dimensionality of tensors, subgraphs collected by PAMSEN cannot be compiled and delegated to the accelerator. The computational graph is delegated to the CPU instead. In such cases, a warning is emitted to the logs:

2021-05-24 23:16:34.331557: W simple_fallback_runner.cpp:39] Delegating node=HABANA_GRAPH_SPECIFIC_NAME to CPU

Such situations can introduce a performance penalty. Computations moved to the CPU can have changed precision from bfloat16 to float32.

1.2.4. TensorFlow Mixed Precision Training on Gaudi

This section describes how to run mixed precision training of TensorFlow models on Gaudi.

1.2.4.1. Op Lists for BF16 Conversion Pass

Gaudi supports mixed precision of float32 and bfloat16. Mixed precision in general can reduce memory size as well as memory bandwidth requirements and accelerate math operations.

To enable BF16 computations instead of FP32, you can:

  • Explicitly modify the python model or,

  • Automatically convert selected ops to be computed in lower precision using Habana’s automatic BF16 conversion pass

The conversion pass uses a notion of allow-, conditional- and blocklists. We also make it possible to provide certain exceptions. Below, you can find an empty template for defining your own BF16 configuration:

{

“allow_list”: [],

“conditional_list”: [],

“strict_conditional_list”: [],

“non_convertible_exceptions”: [],

“convertible_exceptions”: []

}
  • Allowlists contain ops that are 100% numerically safe, which means they can always be converted to and computed in BF16.

  • Blocklists contain ops that are not numerically safe for reduced precision computations. Such lists do not actually appear anywhere explicitly. Any operation that is not present in allow- or conditional lists is blocked by default.

  • Conditional lists contain ops that may behave in an unstable manner if paired with blocked ones. Ops found in these lists are marked for conversion if any input or output is to be converted.

  • Strict conditional lists differ from conditional lists in that their ops are converted only if either all of their inputs are to be converted or the inputs are Variables or Consts.

All nodes that are found suitable for reduced precision computations are divided into groups (based on adjacency) and converted to BF16 in such a manner that Cast nodes are inserted before the first and after the last to-be-converted node in the group.

1.2.4.2. Exception Lists

In addition, there are two other lists, Non convertible exceptions and Convertible exceptions, that allow for more fine-grained control over the precision of specific instances of ops. This feature allows you to mark specified instances as suitable or unsuitable for BF16 conversion, regardless of the ops placed in allow-, conditional or blocklists. For example, it is possible to run some isolated Mul operations in BF16 even if Mul does not appear in either allow- or conditional lists. On the other hand, you can disable specific, for example, Conv2D instances from BF16 conversion even if Conv2D appears in the allowlist.

Specific op instances can be selected by means of providing a name/op-type pair in the convertible or non_convertible exception lists of ops. For example:

{

  “allowlist”: [

      “BatchMatMul”,

      “BatchMatMulV2”,

      “MatMul”

  ],

  “conditional_list”: [],

  “strict_conditional_list”: [],

  “non_convertible_exceptions”: [

      [“gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1”, “”]

  ],

  “convertible_exceptions”: [

      [“bert/encoder/layer_[0-9]+/attention/self/add”, “AddV2”]

  ]

}

In the above example, BatchMatMul(V2) and MatMul are allowed and there are no ops in the conditional or strict conditional lists. There are also single pairs in both lists containing the convertible and non_convertible ops. In this scenario, all MatMul operations except for gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1 will be converted. Also, all AddV2 ops matching the name bert/encoder/layer_[0-9]+/attention/self/add will be run in BF16, even though AddV2 is not mentioned in either allow or conditional lists.

Note that the two additional lists require pairs. The first element is a regex for the name. The second element is a string defining the operation type, and is optional. If the second element is left empty, the mechanism will take all the operations matching the name regex, regardless of the type.

1.2.4.3. JSON Recipe Files for BF16 Configuration

The BF16 configuration files with the op lists and exception lists specifications need to be provided in JSON format. Example JSON mixed precision recipe files can be found in the Model References GitHub page located in the Model-References/TensorFlow/common/bf16_config directory. The following describes the default configurations:

  • full.json – Contains all convertible ops.

  • basic.json – Only general matrix multiplications and convolutions are converted.

  • bert.json – Specific for use in BERT and ALBERT.

  • unet2d.json – Specific for use in UNet2D and UNet3D.

These example conversion configs define a “KEEP_FP32_PRECISION” string in the non_convertible exceptions lists. This prevents nodes containing this infix from being converted. This string can be injected using tf.name_scope.

Set the following environment variable to point to the path to the JSON recipe file for running mixed precision training on Habana:

TF_BF16_CONVERSION=/path/to/model/mixed_precision_config.json

Note

unet and bert configuration files will be replaced with application-specific (vision and nlp respectively) instead of model-specific configurations in subsequent releases.

1.2.5. Additional Tools

The TensorFlow HPU integration also provides user-friendly tools for hardware monitoring, performance profiling and error debugging. For these tools, refer to Profiler User Guide.

1.3. TensorFlow Examples

This section describes how to train models using Habana TensorFlow with Gaudi.

1.3.1. Run Models in Habana Model Repository

After successfully setting up the system, perform the following:

  1. Clone the models located in Model References GitHub page using Git clone.

  2. Launch runs on Gaudi using the README instructions located in TensorFlow Models on GitHub.

1.3.2. Migrate Your Own Model to Gaudi

To port your own models on Gaudi, refer to the Migration Guide and make sure to review the TensorFlow section of the Release Notes.

1.3.3. Host and Device Ops Placement

When the model is ported to run on the HPU, the software stack decides which ops are placed on the CPU and which are placed on the HPU.

Note

The optimization pass automatically places unsupported ops on the CPU.

You may receive an error if some supported ops with limited parameter setup are placed on the HPU. You should place those ops on the CPU using the TF_PLACE_ON_CPU flag. Use the following syntax: TF_PLACE_ON_CPU=[OP1_name],[OP2_name].

1.4. Runtime Flags

The following table describes runtime flags that are set in the environment to change the behavior as well as enable or disable some features. Among the below flags, TF_NUM_INTEROP_THREADS, TF_CPP_MIN_LOG_LEVEL, and TF_CPP_MIN_VLOG_LEVEL are native TensorFlow flags. All other flags are SynapseAI specific.

Flag

Default

Description

TF_PLACE_ON_CPU

Unset

Accepts a comma-separated list of op types to be placed on the CPU by PlaceUnsupportedOpsOnCpu pass. If set to “all_nodes”, all nodes in the graph are placed on the CPU.

HBN_TF_GRAPH_DUMP

1

Controls dumping of TensorFlow graphs after different graph transformation phases.

  • 1 (default) - dumps only from POST_REWRITE_FOR_EXEC

  • 0 - disable dumping

  • Value above 1 - enables dumping from all phases

TF_DUMP_GRAPH_PREFIX

Unset

Sets the path that TensorFlow dumps are saved to.

If unset, graphs will not be dumped. A warning message is shown for built-in TF graph dumping.

ENABLE_CONSOLE

False

If set to _true_, enables printing Synapse_AI logs console.

LOG_LEVEL_ALL

5

Logging level from SynapseAI and perf_lib.

  • 6 is no logs

  • 0 is verbose

By default, logs are placed either in the console (if ENABLE_CONSOLE=true) or under ~/.habana_logs/.

TF_BF16_CONVERSION

Unset

Enables FP32 to BF16 conversion pass for mixed precision training. Currently supported settings:

  • ‘0’ or unset - conversion is disabled

  • /path/to/model/mixed_precision_config.json

Example JSON recipe files are in the Model-References GitHub page in the Model-References/TensorFlow/ common/bf16_config directory

  • full.json - Contains all convertible ops

  • basic.json - Only general matrix multiplications and convolutions are converted

  • bert.json - Specific for use in BERT and ALBERT

  • unet2d.json - Specific for use in UNet2D and UNet3D

TF_ENABLE_PATTERN_MATCHER

Unset

If set to ‘0’, Pattern Matcher optimization pass is disabled. If set to ‘1’, Pattern Matcher optimization pass is enabled regardless of TF_DISABLE_PAMSEN flag.

HABANA_INITIAL_WORKSPACE_SIZE_MB

Unset

Allows setting initial allocated memory size for workspace buffer in MB. That option is mainly for cases in which dynamic workspace allocation does not work properly.

TF_CPU_ALLOCATOR_SIZE_G

Unset

Default allocation strategy which allocates host memory with the below minimum values:

  • 64G (for machines with more than that)

  • 80% of available memory size

  • Available memory size - 16G

If this flag is set to any value, it instructs Habana CPU allocator to override the default configuration of the CPU memory pool size with the given size in Gigabytes.

TF_NUM_INTEROP_THREADS

Unset

If set to a non-zero value, this flag enforces the thread count for TensorFlow op execution. Otherwise, TensorFlow selects the count based on the available cores and MKL/OpenMP configurations.

TF_CPP_MIN_LOG_LEVEL

0

Logging level from native TensorFlow. Lower value means more logs. Valid values range is [0-4].

TF_CPP_MIN_VLOG_LEVEL

0

Another logging level from native TensorFlow. Higher value means more logs. Valid value range is [0-10].

TF_HABANA_ALLOW_LEGACY_VARIABLES_ON_CPU

False

If set to ‘True’, disables legacy Variables registration on HPU and allows them to be executed on CPU. Otherwise, legacy variables registration on HPU will prevent them from being executed at all.