Optimization in TensorFlow Models
On this Page
Optimization in TensorFlow Models¶
A large batch size is, in general, beneficial for throughput. However, some limitations, listed below, apply when using large batch size:
Batch size is limited by Gaudi’s device memory (HBM) size. Usually, larger batch size means more memory consumption in device. Gaudi device memory size is a fixed size.
Large batch size cannot be used when low latency instead of throughput is required.
Large batch size in each Gaudi device may impact the convergence in data parallelism distributed training. For example, the highest global batch size that gives RN50 convergence is around 32K. This means that with an increasing number of Gaudi devices, batch size should be reduced in each device.
The below table provides some examples of batch sizes used in different models, all using mixed precision.
Bert Large pre-training Phase 1
Bert Large pre-training Phase 2
TensorFlow Mixed Precision¶
For details on how to run mixed precision training of TensorFlow models on Gaudi, refer to TensorFlow Mixed Precision Training on Gaudi.
TensorFlow ops are implemented as SynapseAI graphs which usually contain one node (aka HPU op) with TPC or MME kernel invocation. HPU ops are clustered and compiled together by the graph compiler which implements various optimizations to boost performance.
The following sections highlights methods which may improve performance.
Avoid Non-Logical Transpose Operations¶
Transpose taking place on an axis that is not the Fastest-Changing-Dimension is a logical operation and only manipulates tensor metadata. Otherwise, it is a DMA operation that takes some time (copying the entire tensor to another address).
In case transpose is required to match data order in the ground truth, it may be a good idea to modify the dataloader accordingly.
Example 1 – Transpose as logical operation:
in_tensor = tf.compat.v1.placeholder(shape = [32, 8000, 64], dtype=tf.float32) mul = in_tensor \* 1.234 mul = tf.transpose(mul, perm=(1, 0, **2**)) out = mul + 1.0
Example 2 – Transpose as non-logical operation:
in_tensor = tf.compat.v1.placeholder(shape = [32, 8000, 64], dtype=tf.float32) mul = in_tensor \* 1.234 mul = tf.transpose(mul, perm=(0, 2, **1**)) out = mul + 1.0
Gaudi hardware supports hardware padding. Some TensorFlow models use
tf.pad. It should be avoided whenever possible since in some rare
tf.pad is not removed automatically by graph optimizations in
SynapseAI Software Stack.
(Recommended): The below is an example code which does not include
BS, H, W, CH_in, CH_out, F_H, F_W = 16, 128, 128, 16, 8, 3, 3 input_tensor = tf.compat.v1.placeholder(shape = [BS, H, W, CH_in], dtype=tf.float32, name="input") filters = tf.compat.v1.placeholder(shape = [F_H, F_W, CH_in, CH_out], dtype=tf.float32) out = tf.nn.conv2d(input = input_tensor, filters = filters, strides = 1, padding="SAME")
(Not Recommended): The below is an example code which includes
BS, H, W, CH_in, CH_out, F_H, F_W = 16, 128, 128, 16, 8, 3, 3 input_tensor = tf.compat.v1.placeholder(shape = [BS, H, W, CH_in], dtype=tf.float32, name="input") filters = tf.compat.v1.placeholder(shape = [F_H, F_W, CH_in, CH_out], dtype=tf.float32) padded = tf.pad(input_tensor, [[0, 0], [1, 1], [1, 1], [0, 0]]) out = tf.nn.conv2d(input = padded, filters = filters, strides = 1, padding="VALID")
Avoid Calculating L2 Loss in Forward Path¶
Calculating L2-loss slightly slows down training. Below is an example where L2-loss is used:
total_loss = loss + params['weight_decay'] \* tf.add_n( [tf.nn.l2_loss(v) for v in tf.trainable_variables()]) ... train_op = optimizer.minimize(total_loss, global_step) return model_fn_lib.EstimatorSpec( loss=\ **total_loss**, train_op=train_op, … )
It is not necessary to have the L2-loss in the forward path. Replacing total_loss with loss will improve performance. Instead, the gradients can be modified using the below code:
# Add the weight regularization gradient grad = grad + self.weight_decay \* var
This will achieve the same effect from a convergence perspective, but will avoid calculation of the loss value in the forward pass. The downside is that the scalar value of the total loss (loss from topology + regularization loss) will not reflect the regularization loss term – which in many cases is acceptable.
tf.data.prefetch_to_device should be placed at the end of your input pipeline:
device = "/device:HPU:0" dataset = dataset.apply(tf.data.experimental.prefetch_to_device(device))
Refer to https://www.tensorflow.org/api_docs/python/tf/data/experimental/prefetch_to_device for additional details.
In case of topology trained with
Place Ops in HPU¶
Use Ops in the TensorFlow version supported by Habana. If you are using Ops from an old TensorFlow version and placing them in a new TensorFlow version, the Ops may fall back to the CPU. You can set the following two environment variables to dump logs in order to see if the Op runs in CPU or in HPU.
HBN_TF_GRAPH_DUMP=2 TF_DUMP_GRAPH_PREFIX=<path to save the dumped file>
The generated log file is called
TensorFlow native logging for device placement,
tf.debugging.set_log_device_placement, may provide similar information.
You can also use TPC SDK to implement Custom Ops and TF Custom Ops to use in the model. You can find examples in the Custom Op GitHub page.
Advanced: Replace some TPC operations with their MME equivalents¶
In some topologies there are large gaps between MME operations due to long TPC blocks in between. It may be a good idea to use MME for some TPC operations (for example reduce_sum) in parallel. Below is an example located in SSD_ResNet34 GitHub page.
# MME was idle during classification_loss computation. # In this case, conv2d is equivalent to reduce_sum but reduce_sum is executed on TPC while conv2d on MME. sum_exp = tf.nn.conv2d(exp_shifted_logits, reduce_sum_filter, strides=1, padding="VALID", name="sum_exp")
Information about Ops running in TPC and Ops running in MME can be found using profiling. The detailed profiling instructions can be found in the Profiler User Guide.