1. TensorFlow User Guide

1.1. Introduction

This document describes how to run TensorFlow models on the Habana® Gaudi® infrastructure. It provides guidelines for modifying existing models to run on the platform, and provides basic examples to show functionality.

The requirements needed to set up and install the environment are provided in the Setup and Install GitHub page.

1.2. TensorFlow Gaudi Integration Architecture

Habana integrates the TensorFlow framework with SynapseAI compiler in a plugin form through tf.load_library and tf.load_op_library, calling library modules and custom ops/kernels.

The framework integration includes three main components:

  • SynapseAI helpers

  • Device

  • Graph passes

The publicly available TensorFlow version can be used without any changes, allowing you to run models on Gaudi using this integration library. After you launch the training model on the HPU (Habana Processing Unit) with some minor changes in your Python scripts (see more details in Porting a Simple TensorFlow Model to Gaudi), the software stack compiles the graph and saves the recipe to cache. Unless the graph changes or a new graph comes in, no recompilation is needed during the training.

SynapseAI helpers library wraps some common flows and constructions in a RAII style interface and serves as a bridge library between the framework and the SynapseAI library (C-API). The HPU integration registers Habana ops as TensorFlow custom ops on the HPU device. The SynapseAI helpers library also manages memory allocations on device, mapping host memory to the device, DMA transfers between device and host, and streams. It uses the TensorFlow BFC Allocator for fast access to Gaudi memory allocation and deallocation.

1.2.1. Supported Data Types

Gaudi supports TensorFlow ops with the following data type:

  • FP32

  • BF16

  • Int32

  • Int8

  • Boolean

The data type support is specified during the op registration in TensorFlow. To see the currently supported TensorFlow op list on HPU, refer to TensorFlow Operators .

You can convert FP32 to BF16 data type in the python model code or automatically for selected ops that can be computed with low precision. The second approach is similar to the Auto Mixed Precision conversion pass conducted in TensorFlow. For automatic conversion from FP32 to BF16 data type, enable the default conversion recipe by a runtime environment variable TF_ENABLE_BF16_CONVERSION (see more details in Runtime Environment Variables).

1.2.2. Graph Compilation

The TensorFlow framework controls most of the objects required for graph build or graph execution. SynapseAI allows users to create, compile, and launch graph on the device. The Graph passes library optimizes the TensorFlow graph with operations of Pattern Matching, Marking, Segmentation, and Encapsulation (PAMSEN). It is designed to manipulate the TensorFlow graph in order to maximally utilize Gaudi’s HW resources. Given a collection of graph nodes that have implementation for Gaudi, PAMSEN tries to merge as much graph nodes as possible, maintaining graph correctness. By preserving graph semantics and automatically discovering subgraphs that can be fused into one entity, PAMSEN delivers performance that should be on a par with (or exceed) a native TensorFlow level. As XLA (Accelerated Linear Algebra) does, PAMSEN takes graph_cycles and deadness_analysis into account when making decisions about merging nodes in the graph to maintain graph correctness and to make sure that it is not executed in a different way than expected.

In addition, the optimization pass determines op placement on devices (CPU or HPU), data precision down cast (like int64->int32, FP32->BF16,), and runtime constant folding. It also rewrites the TensorFlow size op to Habana size op, converts TensorFlow collective ops to HPU collective ops, and adds control edges between collectives.

The HPU collective ops are implemented using the Habana Communication Library (HCL), which is used to perform communication among different Gaudi cards. For further details, see Habana Communication Library (HCL) API Reference. The TensorFlow HPU integration also supports NCCL-compatible APIs through the Habana Collective Communication Library (HCCL). For further details, see Habana Collective Communications Library (HCCL) API Reference.

Distributed training on Habana Gaudi cards is supported with Horovod and HPUStrategy. See more details about the TensorFlow distributed training on Gaudi in Distributed Training with TensorFlow.

1.2.3. TensorFlow Keras

Keras is an open-source python library which provides many common building blocks to ease development of deep neural network code.

Note

In the past Keras was a separate project. Currently Keras is part of TensorFlow, available as tf.keras module. This is the only Keras version supported on Gaudi.

1.2.3.1. Keras API Support

The following Keras APIs are supported on Gaudi:

  • tf.keras.activations.*,

  • tf.keras.applications.*,

  • tf.keras.backend.*,

  • tf.keras.callbacks.*,

  • tf.keras.constraints.*,

  • tf.keras.estimator.*,

  • tf.keras.initializers.*,

  • tf.keras.layers.*,

  • tf.keras.losses.*,

  • tf.keras.metrics.*,

  • tf.keras.mixed_precision.*,

  • tf.keras.models.*,

  • tf.keras.optimizers.*,

  • tf.keras.regularizers.*,

  • tf.keras.utils.*,

  • tf.keras.wrappers.*,

The following APIs can be used, but some operations may be delegated to CPU:

  • tf.keras.datasets.*,

  • tf.keras.preprocessing.*,

  • all experimental APIs including tf.keras.experimental.*,

1.2.3.2. tf.keras.mixed_precision

tf.keras.mixed_precision is the recommended mixed precision mechanism for Keras models on Gaudi. To start using tf.keras.mixed_precision, set the mixed_bfloat16 policy and float32 data type for a last layer in a model as described in the TensorFlow Mixed Precision Guide.

1.2.3.3. tf.keras.applications

tf.keras.applications contains several models that can be used “as is” with pre-trained weights, used as a base or trained from scratch.

Note

Training from scratch was verified only on limited number of models from tf.keras.applications.

1.2.4. Delegating Computations to CPU

In some cases, like unsupported dimensionality of tensors, subgraphs collected by PAMSEN cannot be compiled and delegated to the accelerator. The computational graph is delegated to the CPU instead. In such cases, a warning is emitted to the logs:

2021-05-24 23:16:34.331557: W simple_fallback_runner.cpp:39] Delegating node=HABANA_GRAPH_SPECIFIC_NAME to CPU

Such situations can introduce a performance penalty. Computations moved to the CPU can have changed precision from bfloat16 to float32.

1.2.5. TensorFlow Mixed Precision Training on Gaudi

This section describes how to run mixed precision training of TensorFlow models on Gaudi.

Note

For Keras models, the recommended mixed precision mechanism is tf.keras.mixed_precision.

Warning

The result of enabling both mixed precision mechanisms is undefined, so BF16 Conversion Pass and tf.keras.mixed_precision should not be used together.

1.2.5.1. Op Lists for BF16 Conversion Pass

Gaudi supports mixed precision of float32 and bfloat16. Mixed precision in general can reduce memory size as well as memory bandwidth requirements and accelerate math operations.

To enable BF16 computations instead of FP32, you can:

  • Explicitly modify the python script containing the model as in the example below, or:

# change op's dtype based on input param to script
if params['dtype'] == 'bf16':
    op = tf.cast(op, dtype=tf.bfloat16)
  • automatically convert selected ops to be computed in lower precision using Habana’s automatic BF16 conversion pass.

The conversion pass uses a notion of Allowlists, Conditional Lists and Blocklists. We also make it possible to provide certain exceptions. Below, you can find an empty template for defining your own BF16 configuration:

{

  "allow_list": [],

  "conditional_list": [],

  "strict_conditional_list": [],

  "non_convertible_exceptions": [],

  "convertible_exceptions": []

}
  • Allowlists contain ops that are 100% numerically safe, which means they can always be converted to and computed in BF16.

  • Blocklists contain ops that are not numerically safe for reduced precision computations. Such lists do not actually appear anywhere explicitly. Any operation that is not present in allow-, conditional or strict conditional lists is blocked by default.

  • Conditional lists contain ops that may behave in an unstable manner if paired with blocked ones. Ops found in these lists are marked for conversion if at least one input or output is to be converted.

  • Strict conditional lists differ from conditional lists in that their ops are converted only if either all of their inputs are to be converted or the inputs are Variables or Consts.

All nodes that are found suitable for reduced precision computations are divided into groups (based on adjacency) and converted to BF16 in such a manner that Cast nodes are inserted before the first and after the last to-be-converted node in the group.

1.2.5.2. Exception Lists

In addition, there are two other lists, Non convertible exceptions and Convertible exceptions, that allow for more fine-grained control over the precision of specific instances of ops. This feature allows you to mark specified instances as suitable or unsuitable for BF16 conversion, regardless of the ops placed in allowlist, conditional or blocklists. For example, it is possible to run some isolated Mul operations in BF16 even if Mul does not appear in either allowlists or conditional lists. On the other hand, you can disable specific, for example, Conv2D instances from BF16 conversion even if Conv2D appears in the allowlist.

Specific op instances can be selected by means of providing a name/op-type pair in the convertible or non_convertible exception lists of ops. For example:

"allowlist": [

      "BatchMatMul",

      "BatchMatMulV2",

      "MatMul”

  ],

  "conditional_list": [],

  "strict_conditional_list": [],

  "non_convertible_exceptions": [

      ["gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1", ""]

  ],

  "convertible_exceptions": [

      ["bert/encoder/layer_[0-9]+/attention/self/add", "AddV2"]

  ]

}

In the above example, BatchMatMul(V2) and MatMul are allowed and there are no ops in the conditional or strict conditional lists. There are also single pairs in both lists containing the convertible and non_convertible ops. In this scenario, all MatMul operations except for gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1 will be converted. Also, all AddV2 ops matching the name bert/encoder/layer_[0-9]+/attention/self/add will be run in BF16, even though AddV2 is not mentioned in either allow or conditional lists.

Note that the two additional lists require pairs. The first element is a regex for the name. The second element is a string defining the operation type, and is optional. If the second element is left empty, the mechanism will take all the operations matching the name regex, regardless of the type.

1.2.5.3. JSON Recipe Files for BF16 Configuration

The BF16 configuration files with the op lists and exception lists specifications need to be provided in JSON format. Example JSON mixed precision recipe files can be found in the Model References GitHub repository located in the Model-References/TensorFlow/common/bf16_config directory. The following describes the default configurations:

  • full.json – Aims at achieving the best performance while still reaching the state of the art accuracy for most models.

  • basic.json – Only general matrix multiplications and convolutions are converted.

  • bert.json – Specific for use in BERT and ALBERT.

  • unet2d.json – Specific for use in UNet2D and UNet3D.

These conversion configs also define two strings - KEEP_FP32_PRECISION in the non_convertible_exceptions and FORCE_BF16_PRECISION in convertible_exceptions. Adding KEEP_FP32_PRECISION to the name scope prevents nodes containing this infix from being converted from FP32 to BF16. Similarly, adding FORCE_BF16_PRECISION forces the affected nodes to be converted to BF16. These strings can be injected using tf.name_scope.

Set the following environment variable to point to the path to the JSON recipe file for running mixed precision training on Habana:

TF_BF16_CONVERSION=/path/to/model/mixed_precision_config.json

1.2.6. Additional Tools

For performance profiling, refer to Profiler User Guide.

1.3. TensorFlow Examples

This section describes how to train models using TensorFlow with Gaudi.

1.3.1. Run Models in Habana Model Repository

After successfully setting up the system, perform the following:

  1. Clone the models located in the Model-References GitHub repository using Git clone.

  2. Launch runs on Gaudi using the README instructions located in the Model-References GitHub repository.

1.3.2. Migrate Your Own Model to Gaudi

To port your own models on Gaudi, refer to the Migration Guide and make sure to review the TensorFlow section of the Release Notes.

1.3.3. Host and Device Ops Placement

When the model is ported to run on the HPU, the software stack decides which ops are placed on the CPU and which are placed on the HPU.

Note

The optimization pass automatically places unsupported ops on the CPU.

You may receive an error if some supported ops with limited parameter setup are placed on the HPU. You should place those ops on the CPU using the TF_PLACE_ON_CPU flag. Use the following syntax: TF_PLACE_ON_CPU=[OP1_name],[OP2_name].

1.4. Runtime Environment Variables

The following table describes runtime environment variables that are set to change the behavior as well as enable or disable some features. Among the below flags, TF_NUM_INTEROP_THREADS, TF_CPP_MIN_LOG_LEVEL, and TF_CPP_MIN_VLOG_LEVEL are native TensorFlow flags. All other flags are SynapseAI specific.

Flag

Default

Description

TF_PLACE_ON_CPU

Unset

Accepts a comma-separated list of op types to be placed on the CPU by PlaceUnsupportedOpsOnCpu pass. If set to “all_nodes”, all nodes in the graph are placed on the CPU.

HBN_TF_GRAPH_DUMP

1

Controls dumping of TensorFlow graphs after different graph transformation phases.

  • 1 (default) - dumps only from POST_REWRITE_FOR_EXEC

  • 0 - disable dumping

  • Value above 1 - enables dumping from all phases

TF_DUMP_GRAPH_PREFIX

Unset

Sets the path that TensorFlow dumps are saved to.

If unset, graphs will not be dumped. A warning message is shown for built-in TF graph dumping.

ENABLE_CONSOLE

false

If set to ‘true’, enables printing Synapse_AI logs console.

LOG_LEVEL_ALL

5

Logging level from SynapseAI and perf_lib.

  • 6 is no logs

  • 0 is verbose

By default, logs are placed either in the console (if ENABLE_CONSOLE=true) or under ~/.habana_logs/.

TF_BF16_CONVERSION

Unset

Enables FP32 to BF16 conversion pass for mixed precision training. Currently supported settings:

  • ‘0’ or unset - conversion is disabled

  • /path/to/model/mixed_precision_config.json

Example JSON recipe files are in the Model-References GitHub repository in the Model-References/TensorFlow/ common/bf16_config directory

  • full.json - Contains all convertible ops

  • basic.json - Only general matrix multiplications and convolutions are converted

  • bert.json - Specific for use in BERT and ALBERT

  • unet2d.json - Specific for use in UNet2D and UNet3D

TF_ENABLE_PATTERN_MATCHER

Unset

If set to ‘0’, Pattern Matcher optimization pass is disabled.

HABANA_INITIAL_WORKSPACE_SIZE_MB

Unset

Allows setting initial allocated memory size for workspace buffer in MB. That option is mainly for cases in which dynamic workspace allocation does not work properly.

TF_CPU_ALLOCATOR_SIZE_G

Unset

Default allocation strategy which allocates host memory with the below minimum values:

  • 64G (for machines with more than that)

  • 80% of available memory size

  • Available memory size - 16G

If this flag is set to any value, it instructs Habana CPU allocator to override the default configuration of the CPU memory pool size with the given size in Gigabytes.

TF_NUM_INTEROP_THREADS

Unset

If set to a non-zero value, this flag enforces the thread count for TensorFlow op execution. Otherwise, TensorFlow selects the count based on the available cores and MKL/OpenMP configurations.

TF_CPP_MIN_LOG_LEVEL

0

Logging level from native TensorFlow. Lower value means more logs. Valid values range is [0-4].

TF_CPP_MIN_VLOG_LEVEL

0

Another logging level from native TensorFlow. Higher value means more logs. Valid value range is [0-10].

TF_HABANA_ALLOW_LEGACY_VARIABLES_ON_CPU

False

If set to ‘True’, disables legacy Variables registration on HPU and allows them to be executed on CPU. Otherwise, legacy variables registration on HPU will prevent them from being executed at all.

TF_RECIPE_CACHE_PATH

Unset

Path (directory), where compiled graph recipes are stored between different runs of the same model (accelerates time of first iteration).

If unset, compiled graph recipes are not stored on disk (recipe disk caching disabled).

In a scale up scenario, different processes on one platform may share the same directory for recipe cache. Only one process compiles the recipe, and other processes read it from disk.

Note: Recipe cache dir is not cleared automatically and can increase in size over time.

Note: If a recipe cache is shared among a few processes (scale up), it must be stored on a local physical disk. Avoid using remote drives (such as NFS) where file locks are not supported, as it it may lead to instability and unpredictable behavior.

1.5. Python Package (habana_frameworks.tensorflow)

This package provides Python-level interface of the TensorFlow bridge for training on Gaudi. The most significant module inside is library_loader which functions load_habana_module(). It is also exposed directly after import and initializes the module properly.

Example:

import habana_frameworks.tensorflow as htf
htf.load_habana_module()

The following sections provide a brief description of each module:

1.5.1. distribute

distribute module contains HPUStrategy class, a drop-in replacement for tensorflow.distribute.MultiWorkerMirroredStrategy class. See Distributed Training with TensorFlow.

1.5.2. grads

grad module contains gradients for public Ops implemented inside the TensorFlow bridge library. These gradients are automatically registered in TensorFlow when calling load_habana_module().

1.5.3. habana_device

habana_device module contains Python interface for extra features of habana_device library. It also contains custom Events handling support.

1.5.4. habana_estimator

habana_estimator module is a custom tf.estimator.Estimator that allows data pre-fetching to Gaudi. See example of tf.data.prefetch_to_device in Model Performance Optimization Guide for TensorFlow.

1.5.5. hccl

hccl module (internal) provides Python wrapper for HCCL API used internally for testing.

1.5.6. horovod_helpers

horovod_helpers module (internal) is a helper module for habana-horovod package.

1.5.7. hw_profiler_helpers

hw_profiler_helpers module contains utility profiling functions that can be enabled via environment variables.

1.5.8. library_loader

library_loader module is a main entry point to the Python Package. It contains load_habana_module() function, that needs to be called to initialize TensorFlow bridge library and enable training on Gaudi. Additionally, there is load_op_library() function to be used with TensorFlow CustomOp API.

1.5.9. lib_utils

lib_utils module validates the environment of the installed habana-tensorflow package and searches libraries needed for initialization.

1.5.10. multinode_helpers

multinode_helpers module initializes multinode environment during call to load_habana_module().

1.5.11. ops

ops module contains public custom Ops implemented inside the TensorFlow bridge library.

1.5.12. py_synapse_logger

py_synapse_logger module contains a Python wrapper for synapse logger library.

1.5.13. synapse_logger_helpers

synapse_logger_helpers module contains helper functions to use py_synapse_logger module.

1.5.14. sysconfig

sysconfig module, similarly to tf.sysconfig, contains __version__ and functions to retrieve information needed for compilation of custom ops:

  • Library location

  • Include location

  • Compiler flags

  • Linker flags

For more details on CustomOp API, see TensorFlow CustomOp API.

1.5.15. tb_utils

tb_utils module contains extensions to default tf.Estimator/keras hooks and TensorBoard visualization classes.

1.5.16. util

util module (internal) contains utility functions.

1.5.17. version_getter

version_getter module (internal) purpose is to retrieve version of TensorFlow bridge library via C API. The user does not need to use this file, since it is invoked to retrieve __version__.