# 8. Model Performance Optimization Guide¶

## 8.1. Introduction¶

This document provides multiple methods that can be implemented in order to achieve the best performance using the Habana® Gaudi® accelerator for your training models.

## 8.2. Optimization in Models¶

### 8.2.1. Batch Size¶

A large batch size is, in general, beneficial for throughput. However, some limitations, listed below, apply when using large batch size:

1. Batch size is limited by Gaudi’s device memory (HBM) size. Usually, larger batch size means more memory consumption in device. Gaudi device memory size is a fixed size.

2. Large batch size cannot be used when low latency instead of throughput is required.

3. Large batch size in each Gaudi device may impact the convergence in data parallelism distributed training. For example, the highest global batch size that gives RN50 convergence is around 32K. This means that with an increasing number of Gaudi devices, batch size should be reduced in each device.

The below table provides some examples of batch sizes used in different models, all using mixed precision.

Models

Batch Size

Resnet50

256

Bert Large pre-training Phase 1

64

Bert Large pre-training Phase 2

8

4

### 8.2.2. Mixed Precision¶

#### 8.2.2.1. TensorFlow Models¶

For details on how to run mixed precision training of TensorFlow models on Gaudi, refer to TensorFlow Mixed Precision Training on Gaudi.

#### 8.2.2.2. PyTorch Models¶

For details on how to run mixed precision training of Pytorch models on Gaudi, refer to PyTorch Mixed Precision Training on Gaudi.

### 8.2.3. (non-)Optimal Constructs¶

TensorFlow ops are implemented as Synapse graphs which usually contain one node (aka HPU op) with TPC or MME kernel invocation. HPU ops are clustered and compiled together by the graph compiler which implements various optimizations to boost performance.

The following sections highlights methods which may improve performance.

#### 8.2.3.1. Avoid Non-Logical Transpose Operations¶

Transpose taking place on an axis that is not the Fastest-Changing-Dimension is a logical operation and only manipulates tensor metadata. Otherwise, it is a DMA operation that takes some time (copying the entire tensor to another address).

In case transpose is required to match data order in the ground truth, it may be a good idea to modify the dataloader accordingly.

Example 1 – Transpose as logical operation:

in_tensor = tf.compat.v1.placeholder(shape = [32, 8000, 64],
dtype=tf.float32)

mul = in_tensor \* 1.234

mul = tf.transpose(mul, perm=(1, 0, **2**))

out = mul + 1.0


Figure 8.1 Transpose as Logical Operation

Example 2 – Transpose as non-logical operation:

in_tensor = tf.compat.v1.placeholder(shape = [32, 8000, 64],
dtype=tf.float32)

mul = in_tensor \* 1.234

mul = tf.transpose(mul, perm=(0, 2, **1**))

out = mul + 1.0


Figure 8.2 Transpose as Non-Logical Operation

#### 8.2.3.2. Avoid Using tf.pad¶

Gaudi hardware supports hardware padding. Some TensorFlow models use tf.pad. It should be avoided whenever possible since in some rare cases tf.pad is not removed automatically by graph optimizations in SynapseAI® Software Stack.

(Recommended): The below is an example code which does not include tf.pad:

BS, H, W, CH_in, CH_out, F_H, F_W = 16, 128, 128, 16, 8, 3, 3

input_tensor = tf.compat.v1.placeholder(shape = [BS, H, W, CH_in],
dtype=tf.float32, name="input")

filters = tf.compat.v1.placeholder(shape = [F_H, F_W, CH_in, CH_out],
dtype=tf.float32)

out = tf.nn.conv2d(input = input_tensor, filters = filters, strides = 1,


(Not Recommended): The below is an example code which includes tf.pad:

BS, H, W, CH_in, CH_out, F_H, F_W = 16, 128, 128, 16, 8, 3, 3

input_tensor = tf.compat.v1.placeholder(shape = [BS, H, W, CH_in],
dtype=tf.float32, name="input")

filters = tf.compat.v1.placeholder(shape = [F_H, F_W, CH_in, CH_out],
dtype=tf.float32)

out = tf.nn.conv2d(input = padded, filters = filters, strides = 1,


Figure 8.3 Hardware profiling - tf.pad + tf.nn.conv2d

#### 8.2.3.3. Avoid Calculating L2 Loss in Forward Path¶

Calculating L2-loss slightly slows down training. Below is an example where L2-loss is used:

total_loss = loss + params['weight_decay'] \* tf.add_n(

[tf.nn.l2_loss(v) for v in tf.trainable_variables()])

...

train_op = optimizer.minimize(total_loss, global_step)

return model_fn_lib.EstimatorSpec(

loss=\ **total_loss**,

train_op=train_op, … )


It is not necessary to have the L2-loss in the forward path. Replacing total_loss with loss will improve performance. Instead, the gradients can be modified using the below code:

# Add the weight regularization gradient



This will achieve the same effect from a convergence perspective, but will avoid calculation of the loss value in the forward pass. The downside is that the scalar value of the total loss (loss from topology + regularization loss) will not reflect the regularization loss term – which in many cases is acceptable.

#### 8.2.3.4. Use tf.data.prefetch_to_device¶

tf.data.prefetch_to_device should be placed at the end of your input pipeline:

device = "/device:HPU:0"

dataset = dataset.apply(tf.data.experimental.prefetch_to_device(device))


Refer to https://www.tensorflow.org/api_docs/python/tf/data/experimental/prefetch_to_device for additional details.

Note

In case of topology trained with tf.estimator.Estimator, use habana_frameworks.tensorflow.habana_estimator.

#### 8.2.3.5. Place Ops in HPU¶

Use Ops in the TensorFlow version supported by Habana. If you are using Ops from an old TensorFlow version and placing them in a new TensorFlow version, the Ops may fall back to the CPU. You can set the following two environment variables to dump logs in order to see if the Op runs in CPU or in HPU.

HBN_TF_GRAPH_DUMP=2

TF_DUMP_GRAPH_PREFIX=<path to save the dumped file>


The generated log file is called habana_POST_PLACEMENT_00_Before.pbtxt. The below example is taken from this file:

node {
name: "MobilenetV2/Conv/BatchNorm/gamma/Assign"
op: "AssignVariableOp"
input: "MobilenetV2/Conv/BatchNorm/gamma"
input: "MobilenetV2/Conv/BatchNorm/gamma/Initializer/ones"
attr {
key: "dtype"
value {
type: DT_FLOAT
}
}
}
node {
name: "activations/layer_1"
op: "HistogramSummary"
input: "activations/layer_1/tag"
input: "MobilenetV2/Conv/CustomRelu6Op"
attr {
key: "T"
value {
type: DT_FLOAT
}
}
}


TensorFlow native logging for device placement, tf.debugging.set_log_device_placement, may provide similar information.

You can also use TPC SDK to implement Custom Ops and TF Custom Ops to use in the model. You can find examples in the Custom Op GitHub page. You can also find an example of MobileNetV2 Custom Op GitHub page.

#### 8.2.3.6. Advanced: Replace some TPC operations with their MME equivalents¶

In some topologies there are large gaps between MME operations due to long TPC blocks in between. It may be a good idea to use MME for some TPC operations (for example reduce_sum) in parallel. Below is an example located in SSD_ResNet34 GitHub page.

# MME was idle during classification_loss computation.

# In this case, conv2d is equivalent to reduce_sum but reduce_sum is
executed on TPC while conv2d on MME.

sum_exp = tf.nn.conv2d(exp_shifted_logits, reduce_sum_filter, strides=1,


Information about Ops running in TPC and Ops running in MME can be found using profiling. The detailed profiling instructions can be found in the Profiler User Guide.

## 8.3. Optimization in Training Platform¶

### 8.3.1. Ensure MPI Run Command Includes Core/Socket Binding/Mapping to the Right Number of Cores¶

--bind-to core --map-by socket:PE=7


This is to ensure CPU affinity is done correctly for your processes and core allocation is distributed as well. In this example, 7 cores are allocated for one process, which controls one Gaudi in the same socket as the 7 cores. The CPU affinity can improve performance not only for distributed training using multiple Gaudi devices in one server, but also for training using a single Gaudi device.

This number can be calculated by using the number of CPU(s). the Thread(s) per core (these 2 numbers are from the command lscpu), and the number of Gaudi devices (for example 8 for Habana’s HLS-1, other servers may have 4 devices). In the below example, there are 112 CPU(s) and 2 Thread(s) per core, therefore, 112 / 2 / 8 = 7 cores per Gaudi.

 \$ lscpu … CPU(s): 112 On-line CPU(s) list: 0-111 Thread(s) per core: 2 …

### 8.3.2. Set CPU Setting to Performance¶

The below is an example of setting the CPU to peformance for Ubuntu:

Get setting: “cat
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor”

Set setting: “echo performance \| sudo tee
/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor”


### 8.3.3. Use Low Latency Kernel¶

There is a 15-20% performance drop on generic kernel in case of BERT, compared with low latency kernel (5.4.0-72-lowlatency), for example, in Ubuntu 18.04. You can use the command line below to install low latency kernel:

sudo apt install linux-lowlatency-hwe-18.04


### 8.3.4. Ensure FW/SW Packages/BMC/CPLD are the Latest¶

Refer to Check Habana Package Installation for no Docker to check for the latest FW/SW Packages/BMC/CPLD versions and make sure they are installed properly.

### 8.3.5. Ensure Docker Run Command Used is the Latest¶

Refer to Run Docker Command to make sure docker run command is installed accordingly.

### 8.3.7. Ensure Gaudi Card Frequency is Set Correctly¶

Refer to Gaudi Clock Freq to check Gaudi card frequency using hl-smi.

### 8.3.8. Ensure Dataset/Model Code/Output are Placed on High Performing Hard Drive (NVME/SSD)¶

For best performance, use NVME for all datasets, model code, and output locations when running the training.

### 8.3.9. Ensure Network ifs for Habana Cards Have Static IP¶

Set the Habana card ifs to have static IPs by running the below command:

./manage_network_ifs.sh --set-ip


For further details, refer to manage_network_ifs.sh.