TensorFlow Mixed Precision Training on Gaudi

This section describes how to run mixed precision training of TensorFlow models on Intel® Gaudi® AI accelerator.

Note

For Keras models, the recommended mixed precision mechanism is tf.keras.mixed_precision.

Warning

The result of enabling both mixed precision mechanisms is undefined, so BF16 Conversion Pass and tf.keras.mixed_precision should not be used together.

Op Lists for BF16 Conversion Pass

Gaudi supports mixed precision of float32 and bfloat16. Mixed precision in general can reduce memory size as well as memory bandwidth requirements and accelerate math operations.

To enable BF16 computations instead of FP32, you can:

  • Explicitly modify the python script containing the model as in the example below:

# change op's dtype based on input param to script
if params['dtype'] == 'bf16':
    op = tf.cast(op, dtype=tf.bfloat16)
  • Or, automatically convert selected ops to be computed in lower precision using Intel Gaudi’s automatic BF16 conversion pass.

The conversion pass uses a notion of Allowlists, Conditional Lists and Blocklists. We also make it possible to provide certain exceptions. Below, you can find an empty template for defining your own BF16 configuration:

{

  "allow_list": [],

  "conditional_list": [],

  "strict_conditional_list": [],

  "non_convertible_exceptions": [],

  "convertible_exceptions": []

}
  • Allowlists contain ops that are 100% numerically safe, which means they can always be converted to and computed in BF16.

  • Blocklists contain ops that are not numerically safe for reduced precision computations. Such lists do not actually appear anywhere explicitly. Any operation that is not present in allow-, conditional or strict conditional lists is blocked by default.

  • Conditional lists contain ops that may behave in an unstable manner if paired with blocked ones. Ops found in these lists are marked for conversion if at least one input or output is to be converted.

  • Strict conditional lists differ from conditional lists in that their ops are converted only if either all of their inputs are to be converted or the inputs are Variables or Consts.

All nodes that are found suitable for reduced precision computations are divided into groups (based on adjacency) and converted to BF16 in such a manner that Cast nodes are inserted before the first and after the last to-be-converted node in the group.

Exception Lists

In addition, there are two other lists, Non convertible exceptions and Convertible exceptions, that allow for more fine-grained control over the precision of specific instances of ops. This feature allows you to mark specified instances as suitable or unsuitable for BF16 conversion, regardless of the ops placed in allowlist, conditional or blocklists. For example, it is possible to run some isolated Mul operations in BF16 even if Mul does not appear in either allowlists or conditional lists. On the other hand, you can disable specific, for example, Conv2D instances from BF16 conversion even if Conv2D appears in the allowlist.

Specific op instances can be selected by means of providing a name/op-type pair in the convertible or non_convertible exception lists of ops. For example:

"allowlist": [

      "BatchMatMul",

      "BatchMatMulV2",

      "MatMul”

  ],

  "conditional_list": [],

  "strict_conditional_list": [],

  "non_convertible_exceptions": [

      ["gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1", ""]

  ],

  "convertible_exceptions": [

      ["bert/encoder/layer_[0-9]+/attention/self/add", "AddV2"]

  ]

}

In the above example, BatchMatMul(V2) and MatMul are allowed and there are no ops in the conditional or strict conditional lists. There are also single pairs in both lists containing the convertible and non_convertible ops. In this scenario, all MatMul operations except for gradients/bert/encoder/layer_0/attention/self/key/MatMul_grad/MatMul_1 will be converted. Also, all AddV2 ops matching the name bert/encoder/layer_[0-9]+/attention/self/add will be run in BF16, even though AddV2 is not mentioned in either allow or conditional lists.

Note that the two additional lists require pairs. The first element is a regex for the name. The second element is a string defining the operation type, and is optional. If the second element is left empty, the mechanism will take all the operations matching the name regex, regardless of the type.

JSON Recipe Files for BF16 Configuration

There are two ways to provide BF16 conversion list:

  1. Using predefined lists:

  • Full – Aims at achieving the best performance while still reaching the state of the art accuracy for most models.

  • Basic – Only general matrix multiplications and convolutions are converted.

These lists can be dumped to a JSON file, modified and reused by setting TF_BF16_DUMP_PATH flag:

TF_BF16_DUMP_PATH=/path/to/output/file.json

2. Using existing topology specific configs provided in the Model-Reference GitHub page. For example, you can find bert.json here.

These conversion configs also define two strings - KEEP_FP32_PRECISION in the non_convertible_exceptions and FORCE_BF16_PRECISION in convertible_exceptions. Adding KEEP_FP32_PRECISION to the name scope prevents nodes containing this infix from being converted from FP32 to BF16. Similarly, adding FORCE_BF16_PRECISION forces the affected nodes to be converted to BF16. These strings can be injected using tf.name_scope.

Set the following environment variable to point to the path to the JSON recipe file for running mixed precision training on Gaudi:

TF_BF16_CONVERSION=/path/to/model/mixed_precision_config.json

Additional Tools

For performance profiling, refer to Profiler User Guide.