1. TensorFlow User Guide

1.1. Introduction

This document describes how to run TensorFlow models on the Habana® Gaudi® infrastructure. It provides guidelines for modifying existing models to run on the platform and uses a basic example to show functionality.

The requirements needed to set up and install the environment are provided in the TensorFlow Installation section.


Please make sure that the version of the SynapseAI software stack installation matches the version of the Docker images you are using. Our documentation on docs.habana.ai is also versioned, so select the appropriate version. The Setup and Install GitHub repository as well as the Model-References GitHub repository have branches for each release version. Make sure you are selecting the branch that matches the version of your SynapseAI software installation. For example, if SynapseAI software version 0.15.4 is installed, then you would clone the Model-References repository like this: % git clone -b 0.15.4 https://github.com/HabanaAI/Model-References. To confirm the SynapseAI Software version on your build, run the hl-smi tool and look at the “Driver Version”. (see the figure below)


Figure 1.3 SynapseAI Version Check

1.2. TensorFlow Gaudi Integration Architecture

Habana integrates the TensorFlow framework with SynapseAI compiler in a plugin form through tf.load_library and tf.load_op_library, calling library modules and custom ops/kernels.

The framework integration includes three main components:

  • SynapseAI helpers

  • Device

  • Graph passes

The publicly available TensorFlow version can be used without any changes, allowing you to run models on Gaudi using this integration library. After you launch the training model on the HPU (Habana Processing Unit) with some minor changes in your Python scripts (see more details in porting_simple_TF_model), the software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training.

SynapseAI helpers library wraps some common flows and constructions in a RAII style interface and serves as a bridge library between the framework and the SynapseAI library (C-API). The HPU integration registers Habana ops as TensorFlow custom ops on the HPU device. The SynapseAI helpers library also manages memory allocations on device, mapping host memory to the device, DMA transfers between device and host, and streams. It uses the TensorFlow BFC Allocator for fast access to Gaudi memory allocation and deallocation.

1.2.1. Supported Data Types

Gaudi supports TensorFlow ops with the following data type:

  • FP32

  • BF16

  • Int32

  • Int8

  • Boolean

The data type support is specified during the op registration in TensorFlow. To see the currently supported TensorFlow op list on HPU, refer to TensorFlow Operators .

You can convert FP32 to BF16 data type in the python model code or automatically for selected ops that can be computed with low precision. The second approach is similar to the Auto Mixed Precision conversion pass conducted in TensorFlow. For automatic conversion from FP32 to BF16 data type, enable the default conversion recipe by a runtime environment variable TF_ENABLE_BF16_CONVERSION (see more details in Runtime Environment Variables).

1.2.2. Graph Compilation

The TensorFlow framework controls most of the objects required for graph build or graph execution. SynapseAI allows users to create, compile, and launch graph on the device. The Graph passes library optimizes the TensorFlow graph with operations of Pattern Matching, Marking, Segmentation, and Encapsulation (PAMSEN). It is designed to manipulate the TensorFlow graph in order to maximally utilize Gaudi’s HW resources. Given a collection of graph nodes that have implementation for Gaudi, PAMSEN tries to merge as much graph nodes as possible, maintaining graph correctness. By preserving graph semantics and automatically discovering subgraphs that can be fused into one entity, PAMSEN delivers performance that should be on a par with (or exceed) a native TensorFlow level. As XLA (Accelerated Linear Algebra) does, PAMSEN takes graph_cycles and deadness_analysis into account when making decisions about merging nodes in the graph to maintain graph correctness and to make sure that it is not executed in a different way than expected.

In addition, the optimization pass determines op placement on devices (CPU or HPU), data precision down cast (like int64->int32, FP32->BF16,), and runtime constant folding. It also rewrites the TensorFlow size op to Habana size op, converts TensorFlow collective ops to HPU collective ops, and adds control edges between collectives.

The HPU collective ops are implemented using the Habana Communication Library (HCL), which is used to perform communication among different Gaudi cards. For further details, see Habana Communication Library (HCL) API Reference. The TensorFlow HPU integration also supports NCCL-compatible APIs through the Habana Collective Communication Library (HCCL). For further details, see Habana Collective Communications Library (HCCL) API Reference.

Distributed training on Habana Gaudi cards is supported with Horovod and HPUStrategy. See more details about the TensorFlow distributed training on Gaudi in Distributed Training with TensorFlow.

1.2.3. TensorFlow Keras

Keras is an open-source python library which provides many common building blocks to ease development of deep neural network code.


In the past Keras was a separate project. Currently Keras is part of TensorFlow, available as tf.keras module. This is the only Keras version supported on Gaudi. Keras API Support

The following Keras APIs are supported on Gaudi:

  • tf.keras.activations.*,

  • tf.keras.applications.*,

  • tf.keras.backend.*,

  • tf.keras.callbacks.*,

  • tf.keras.constraints.*,

  • tf.keras.estimator.*,

  • tf.keras.initializers.*,

  • tf.keras.layers.*,

  • tf.keras.losses.*,

  • tf.keras.metrics.*,

  • tf.keras.mixed_precision.*,

  • tf.keras.models.*,

  • tf.keras.optimizers.*,

  • tf.keras.regularizers.*,

  • tf.keras.utils.*,

  • tf.keras.wrappers.*,

The following APIs can be used, but some operations may be delegated to CPU:

  • tf.keras.datasets.*,

  • tf.keras.preprocessing.*,

  • all experimental APIs including tf.keras.experimental.*, tf.keras.mixed_precision

tf.keras.mixed_precision is the recommended mixed precision mechanism for Keras models on Gaudi. To start using tf.keras.mixed_precision, set the mixed_bfloat16 policy and float32 data type for a last layer in a model as described in the TensorFlow Mixed Precision Guide. tf.keras.applications

tf.keras.applications contains several models that can be used “as is” with pre-trained weights, used as a base or trained from scratch.


Training from scratch was verified only on limited number of models from tf.keras.applications.

1.2.4. Delegating Computations to CPU

In some cases, like unsupported dimensionality of tensors, subgraphs collected by PAMSEN cannot be compiled and delegated to the accelerator. The computational graph is delegated to the CPU instead. In such cases, a warning is emitted to the logs:

2021-05-24 23:16:34.331557: W simple_fallback_runner.cpp:39] Delegating node=HABANA_GRAPH_SPECIFIC_NAME to CPU

Such situations can introduce a performance penalty. Computations moved to the CPU can have changed precision from bfloat16 to float32.

1.2.5. TensorFlow Mixed Precision Training on Gaudi

This section describes how to run mixed precision training of TensorFlow models on Gaudi.


For Keras models, the recommended mixed precision mechanism is tf.keras.mixed_precision.


The result of enabling both mixed precision mechanisms is undefined, so BF16 Conversion Pass and tf.keras.mixed_precision should not be used together. Op Lists for BF16 Conversion Pass

Gaudi supports mixed precision of float32 and bfloat16. Mixed precision in general can reduce memory size as well as memory bandwidth requirements and accelerate math operations.

To enable BF16 computations instead of FP32, you can:

  • Explicitly modify the python script containing the model as in the example below, or:

# change op's dtype based on input param to script
if params['dtype'] == 'bf16':
    op = tf.cast(op, dtype=tf.bfloat16)
  • automatically convert selected ops to be computed in lower precision using Habana’s automatic BF16 conversion pass.

The conversion pass uses a notion of Allowlists, Conditional Lists and Blocklists. We also make it possible to provide certain exceptions. Below, you can find an empty template for defining your own BF16 configuration:


  "allow_list": [],

  "conditional_list": [],

  "strict_conditional_list": [],

  "non_convertible_exceptions": [],

  "convertible_exceptions": []

  • Allowlists contain ops that are 100% numerically safe, which means they can always be converted to and computed in BF16.

  • Blocklists contain ops that are not numerically safe for reduced precision computations. Such lists do not actually appear anywhere explicitly. Any operation that is not present in allow-, conditional or strict conditional lists is blocked by default.

  • Conditional lists contain ops that may behave in an unstable manner if paired with blocked ones. Ops found in these lists are marked for conversion if at least one input or output is to be converted.

  • Strict conditional lists differ from conditional lists in that their ops are converted only if either all of their inputs are to be converted or the inputs are Variables or Consts.

All nodes that are found suitable for reduced precision computations are divided into groups (based on adjacency) and converted to BF16 in such a manner that Cast nodes are inserted before the first and after the last to-be-converted node in the group. Exception Lists

In addition, there are two other lists, Non convertible exceptions and Convertible exceptions, that allow for more fine-grained control over the precision of specific instances of ops. This feature allows you to mark specified instances as suitable or unsuitable for BF16 conversion, regardless of the ops placed in allowlist, conditional or blocklists. For example, it is possible to run some isolated Mul operations in BF16 even if Mul does not appear in either allowlists or conditional lists. On the other hand, you can disable specific, for example, Conv2D instances from BF16 conversion even if Conv2D appears in the allowlist.

Specific op instances can be selected by means of providing a name/op-type pair in the convertible or non_convertible exception lists of ops. For example:

"allowlist": [





  "conditional_list": [],

  "strict_conditional_list": [],

  "non_convertible_exceptions": [

      ["gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1", ""]


  "convertible_exceptions": [

      ["bert/encoder/layer_[0-9]+/attention/self/add", "AddV2"]



In the above example, BatchMatMul(V2) and MatMul are allowed and there are no ops in the conditional or strict conditional lists. There are also single pairs in both lists containing the convertible and non_convertible ops. In this scenario, all MatMul operations except for gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1 will be converted. Also, all AddV2 ops matching the name bert/encoder/layer_[0-9]+/attention/self/add will be run in BF16, even though AddV2 is not mentioned in either allow or conditional lists.

Note that the two additional lists require pairs. The first element is a regex for the name. The second element is a string defining the operation type, and is optional. If the second element is left empty, the mechanism will take all the operations matching the name regex, regardless of the type. JSON Recipe Files for BF16 Configuration

The BF16 configuration files with the op lists and exception lists specifications need to be provided in JSON format. Example JSON mixed precision recipe files can be found in the Model References GitHub repository located in the Model-References/TensorFlow/common/bf16_config directory. The following describes the default configurations:

  • full.json – Aims at achieving the best performance while still reaching the state of the art accuracy for most models.

  • basic.json – Only general matrix multiplications and convolutions are converted.

  • bert.json – Specific for use in BERT and ALBERT.

  • unet2d.json – Specific for use in UNet2D and UNet3D.

These conversion configs also define two strings - KEEP_FP32_PRECISION in the non_convertible_exceptions and FORCE_BF16_PRECISION in convertible_exceptions. Adding KEEP_FP32_PRECISION to the name scope prevents nodes containing this infix from being converted from FP32 to BF16. Similarly, adding FORCE_BF16_PRECISION forces the affected nodes to be converted to BF16. These strings can be injected using tf.name_scope.

Set the following environment variable to point to the path to the JSON recipe file for running mixed precision training on Habana:


1.2.6. Additional Tools

For performance profiling, refer to Profiler User Guide.

1.3. TensorFlow Examples

This section describes how to train models using TensorFlow with Gaudi.

1.3.1. Run Models in Habana Model Repository

After successfully setting up the system, perform the following:

  1. Clone the models located in the Model-References GitHub repository using Git clone.

  2. Launch runs on Gaudi using the README instructions located in the GitHub Model-References repository.

1.3.2. Migrate Your Own Model to Gaudi

To port your own models on Gaudi, refer to the Migration Guide and make sure to review the TensorFlow section of the Release Notes.

1.3.3. Host and Device Ops Placement

When the model is ported to run on the HPU, the software stack decides which ops are placed on the CPU and which are placed on the HPU.


The optimization pass automatically places unsupported ops on the CPU.

You may receive an error if some supported ops with limited parameter setup are placed on the HPU. You should place those ops on the CPU using the TF_PLACE_ON_CPU flag. Use the following syntax: TF_PLACE_ON_CPU=[OP1_name],[OP2_name].

1.4. Runtime Environment Variables

The following table describes runtime environment variables that are set to change the behavior as well as enable or disable some features. Among the below flags, TF_NUM_INTEROP_THREADS, TF_CPP_MIN_LOG_LEVEL, and TF_CPP_MIN_VLOG_LEVEL are native TensorFlow flags. All other flags are SynapseAI specific.






Accepts a comma-separated list of op types to be placed on the CPU by PlaceUnsupportedOpsOnCpu pass. If set to “all_nodes”, all nodes in the graph are placed on the CPU.



Controls dumping of TensorFlow graphs after different graph transformation phases.

  • 1 (default) - dumps only from POST_REWRITE_FOR_EXEC

  • 0 - disable dumping

  • Value above 1 - enables dumping from all phases



Sets the path that TensorFlow dumps are saved to.

If unset, graphs will not be dumped. A warning message is shown for built-in TF graph dumping.



If set to ‘true’, enables printing Synapse_AI logs console.



Logging level from SynapseAI and perf_lib.

  • 6 is no logs

  • 0 is verbose

By default, logs are placed either in the console (if ENABLE_CONSOLE=true) or under ~/.habana_logs/.



Enables FP32 to BF16 conversion pass for mixed precision training. Currently supported settings:

  • ‘0’ or unset - conversion is disabled

  • /path/to/model/mixed_precision_config.json

Example JSON recipe files are in the Model-References GitHub repository in the Model-References/TensorFlow/ common/bf16_config directory

  • full.json - Contains all convertible ops

  • basic.json - Only general matrix multiplications and convolutions are converted

  • bert.json - Specific for use in BERT and ALBERT

  • unet2d.json - Specific for use in UNet2D and UNet3D



If set to ‘0’, Pattern Matcher optimization pass is disabled.



Allows setting initial allocated memory size for workspace buffer in MB. That option is mainly for cases in which dynamic workspace allocation does not work properly.



Default allocation strategy which allocates host memory with the below minimum values:

  • 64G (for machines with more than that)

  • 80% of available memory size

  • Available memory size - 16G

If this flag is set to any value, it instructs Habana CPU allocator to override the default configuration of the CPU memory pool size with the given size in Gigabytes.



If set to a non-zero value, this flag enforces the thread count for TensorFlow op execution. Otherwise, TensorFlow selects the count based on the available cores and MKL/OpenMP configurations.



Logging level from native TensorFlow. Lower value means more logs. Valid values range is [0-4].



Another logging level from native TensorFlow. Higher value means more logs. Valid value range is [0-10].



If set to ‘True’, disables legacy Variables registration on HPU and allows them to be executed on CPU. Otherwise, legacy variables registration on HPU will prevent them from being executed at all.

1.5. Python Package (habana_frameworks.tensorflow)

This package provides Python-level interface of the TensorFlow bridge for training on Gaudi.

It contains the modules listed in the below example:


The following sections provide a brief description of each module.

1.5.1. library_loader

This is a main entry point to the Python Package. It contains load_habana_module() function, that needs to be called to initialize TensorFlow bridge library and enable training on Gaudi.

1.5.2. distribute

It contains HPUStrategy class, a drop-in replacement for tensorflow.distribute.MultiWorkerMirroredStrategy class. In order to use it, call from habana_frameworks.tensorflow.distribute import HPUStrategy. For more details see Distributed Training with TensorFlow.

1.5.3. ops

It contains public custom Ops implemented inside the TensorFlow bridge library.

1.5.4. habana_estimator

It is a custom tf.estimator.Estimator that allows data pre-fetching to Gaudi. See habana_frameworks.tf_patches.preloding_patch.

1.5.5. hccl

(internal) Python wrapper for HCCL API used internally for testing.

1.5.6. hpu_grads

(internal) It contains registration functions for custom gradients in TensorFlow. User does not need to use this file - it’s invoked at load_habana_module() time.

1.5.7. hw_profiler_helpers

It contains utility profiling functions that can be enabled via environment variables.

1.5.8. lib_utils

It validates the environment of the installed habana-tensorflow package and searches libraries needed for initialization.

1.5.9. multinode_helpers

It initializes multinode environment during call to load_habana_module().

1.5.10. sysconfig

Similarly to tf.sysconfig, it contains __version__ and functions to retrieve information needed for compilation of custom ops:

  • Library location

  • Include location

  • Compiler flags

  • Linker flags

For more details on CustomOp API, see TensorFlow CustomOp API.

1.5.11. util

(internal) It contains utility functions.

1.5.12. version_getter

(internal) Its purpose is to retrieve version of TensorFlow bridge library via C API. User does not need to use this file - it’s invoked to retrieve __version__.

1.5.13. Additional utility packages (habana_frameworks.tf_patches)

In addition to habana_frameworks.tensorflow, tf_patches package is also provided. It contains modules that should be imported before habana_frameworks.tensorflow into the script. Currently, it contains preloading_patch needed to enable data preloading to Gaudi.