Optimization in TensorFlow Models

Batch Size

A large batch size is, in general, beneficial for throughput. However, some limitations, listed below, apply when using large batch size:

  1. Batch size is limited by Gaudi’s device memory (HBM) size. Usually, larger batch size means more memory consumption in device. Gaudi device memory size is a fixed size.

  2. Large batch size cannot be used when low latency instead of throughput is required.

  3. Large batch size in each Gaudi device may impact the convergence in data parallelism distributed training. For example, the highest global batch size that gives RN50 convergence is around 32K. This means that with an increasing number of Gaudi devices, batch size should be reduced in each device.

The below table provides some examples of batch sizes used in different models, all using mixed precision.

Models

Batch Size

ResNet50

256

Bert Large pre-training Phase 1

64

Bert Large pre-training Phase 2

8

MaskRCNN

4

TensorFlow Mixed Precision

For details on how to run mixed precision training of TensorFlow models on Gaudi, refer to TensorFlow Mixed Precision Training on Gaudi.

(non-)Optimal Constructs

TensorFlow ops are implemented as Synapse graphs which usually contain one node (aka HPU op) with TPC or MME kernel invocation. HPU ops are clustered and compiled together by the graph compiler which implements various optimizations to boost performance.

The following sections highlights methods which may improve performance.

Avoid Non-Logical Transpose Operations

Transpose taking place on an axis that is not the Fastest-Changing-Dimension is a logical operation and only manipulates tensor metadata. Otherwise, it is a DMA operation that takes some time (copying the entire tensor to another address).

In case transpose is required to match data order in the ground truth, it may be a good idea to modify the dataloader accordingly.

Example 1 – Transpose as logical operation:

in_tensor = tf.compat.v1.placeholder(shape = [32, 8000, 64],
dtype=tf.float32)

mul = in_tensor \* 1.234

mul = tf.transpose(mul, perm=(1, 0, **2**))

out = mul + 1.0
../../_images/Transpose_as_logical_operation.JPG

Figure 6 Transpose as Logical Operation

Example 2 – Transpose as non-logical operation:

in_tensor = tf.compat.v1.placeholder(shape = [32, 8000, 64],
dtype=tf.float32)

mul = in_tensor \* 1.234

mul = tf.transpose(mul, perm=(0, 2, **1**))

out = mul + 1.0
../../_images/Transpose_as_non_logical_operation.JPG

Figure 7 Transpose as Non-Logical Operation

Avoid Using tf.pad

Gaudi hardware supports hardware padding. Some TensorFlow models use tf.pad. It should be avoided whenever possible since in some rare cases tf.pad is not removed automatically by graph optimizations in SynapseAI® Software Stack.

(Recommended): The below is an example code which does not include tf.pad:

BS, H, W, CH_in, CH_out, F_H, F_W = 16, 128, 128, 16, 8, 3, 3

input_tensor = tf.compat.v1.placeholder(shape = [BS, H, W, CH_in],
dtype=tf.float32, name="input")

filters = tf.compat.v1.placeholder(shape = [F_H, F_W, CH_in, CH_out],
dtype=tf.float32)

out = tf.nn.conv2d(input = input_tensor, filters = filters, strides = 1,
padding="SAME")

(Not Recommended): The below is an example code which includes tf.pad:

BS, H, W, CH_in, CH_out, F_H, F_W = 16, 128, 128, 16, 8, 3, 3

input_tensor = tf.compat.v1.placeholder(shape = [BS, H, W, CH_in],
dtype=tf.float32, name="input")

filters = tf.compat.v1.placeholder(shape = [F_H, F_W, CH_in, CH_out],
dtype=tf.float32)

padded = tf.pad(input_tensor, [[0, 0], [1, 1], [1, 1], [0, 0]])

out = tf.nn.conv2d(input = padded, filters = filters, strides = 1,
padding="VALID")
../../_images/Hardware_profiling_tf.pad_tf.nn.conv2d.jpg

Figure 8 Hardware profiling - tf.pad + tf.nn.conv2d

Avoid Calculating L2 Loss in Forward Path

Calculating L2-loss slightly slows down training. Below is an example where L2-loss is used:

total_loss = loss + params['weight_decay'] \* tf.add_n(

[tf.nn.l2_loss(v) for v in tf.trainable_variables()])

...

train_op = optimizer.minimize(total_loss, global_step)

return model_fn_lib.EstimatorSpec(

loss=\ **total_loss**,

train_op=train_op, … )

It is not necessary to have the L2-loss in the forward path. Replacing total_loss with loss will improve performance. Instead, the gradients can be modified using the below code:

# Add the weight regularization gradient

grad = grad + self.weight_decay \* var

This will achieve the same effect from a convergence perspective, but will avoid calculation of the loss value in the forward pass. The downside is that the scalar value of the total loss (loss from topology + regularization loss) will not reflect the regularization loss term – which in many cases is acceptable.

Use tf.data.prefetch_to_device

tf.data.prefetch_to_device should be placed at the end of your input pipeline:

device = "/device:HPU:0"

dataset = dataset.apply(tf.data.experimental.prefetch_to_device(device))

Refer to https://www.tensorflow.org/api_docs/python/tf/data/experimental/prefetch_to_device for additional details.

Note

In case of topology trained with tf.estimator.Estimator, use habana_frameworks.tensorflow.habana_estimator.

Place Ops in HPU

Use Ops in the TensorFlow version supported by Habana. If you are using Ops from an old TensorFlow version and placing them in a new TensorFlow version, the Ops may fall back to the CPU. You can set the following two environment variables to dump logs in order to see if the Op runs in CPU or in HPU.

HBN_TF_GRAPH_DUMP=2

TF_DUMP_GRAPH_PREFIX=<path to save the dumped file>

The generated log file is called habana_POST_PLACEMENT_00_Before.pbtxt. The below example is taken from this file:

node {
  name: "MobilenetV2/Conv/BatchNorm/gamma/Assign"
  op: "AssignVariableOp"
  input: "MobilenetV2/Conv/BatchNorm/gamma"
  input: "MobilenetV2/Conv/BatchNorm/gamma/Initializer/ones"
  device: "/job:localhost/replica:0/task:0/device:HPU:0"
  attr {
    key: "dtype"
    value {
      type: DT_FLOAT
    }
  }
}
node {
  name: "activations/layer_1"
  op: "HistogramSummary"
  input: "activations/layer_1/tag"
  input: "MobilenetV2/Conv/CustomRelu6Op"
  device: "/job:localhost/replica:0/task:0/device:CPU:0"
  attr {
    key: "T"
    value {
      type: DT_FLOAT
    }
  }
}

TensorFlow native logging for device placement, tf.debugging.set_log_device_placement, may provide similar information.

You can also use TPC SDK to implement Custom Ops and TF Custom Ops to use in the model. You can find examples in the Custom Op GitHub page. You can also find an example of MobileNetV2 Custom Op GitHub page.

Advanced: Replace some TPC operations with their MME equivalents

In some topologies there are large gaps between MME operations due to long TPC blocks in between. It may be a good idea to use MME for some TPC operations (for example reduce_sum) in parallel. Below is an example located in SSD_ResNet34 GitHub page.

# MME was idle during classification_loss computation.

# In this case, conv2d is equivalent to reduce_sum but reduce_sum is
executed on TPC while conv2d on MME.

sum_exp = tf.nn.conv2d(exp_shifted_logits, reduce_sum_filter, strides=1,
padding="VALID", name="sum_exp")

Information about Ops running in TPC and Ops running in MME can be found using profiling. The detailed profiling instructions can be found in the Profiler User Guide.