9. Model Performance Optimization Guide for PyTorch

9.1. Introduction

This document provides multiple methods that can be implemented in order to achieve the best performance using the Habana® Gaudi® accelerator for training your PyTorch models.

9.2. Optimization in Models

9.2.1. Batch Size

A large batch size is, in general, beneficial for throughput. However, some limitations apply when using large batch size. For a list of limitation, see Batch Size.

9.2.2. PyTorch Mixed Precision

For details on how to run mixed precision training of Pytorch models on Gaudi, refer to PyTorch Mixed Precision Training on Gaudi.

9.2.3. Convolution Weight Ordering in PyTorch Habana Vision Topologies

Convolution operations are central to vision topologies like ResNet. Gaudi HW performs convolution operations with filter (weights) in filters last format - RSCK format where:

  • R = height of the filter

  • S = width of the filter

  • C = number of channels per filter

  • K = number of filters

The default PyTorch convolution weight ordering is ‘filters first’ (KCRS). Therefore a re-ordering/permutation of all the convolution weights from KCRS to RSCK format is required before convolution operations. Such a permutation of weights is done once at the beginning of training in the PyTorch Habana vision topologies. However, since due to this permutation the weights are in RSCK format during training, a conversion back to KCRS format is necessary when saving intermediate checkpoints or saving the final trained weights. This helps bring the weights back to the default PyTorch format (KCRS), say, for use across DL training platforms.

Due to the permutation of weights to RSCK format, the gradients of these weights will also be in the same format on the HPU automatically. Any other tensors that are calculated as a function of the convolution weights (or gradients thereof) on HPU will also be in RSCK format. An example of such is the ‘momentum’ tensors corresponding to convolution weights in a ResNet model trained with Stochastic Gradient Descent with Momentum optimizer. Appropriate permutations to be in alignment with default destination format should be done if these tensors (convolution weights, gradients, momentum etc) are to be transferred across CPU and HPU (for example, CPU (KCRS) <–> (RSCK) HPU)

The following sections list the various scenarios in which such permutations need to be handled and provides recommendations on how to handle them. The instructions refer to permutations done in the ResNet training script located in the PyTorch Model Reference GitHub page.

9.2.3.1. Scenario 1: Initializing Training from the Beginning

  1. Initialize the weight on the CPU for the entire model.

  2. Move the model to ‘hpu’ device, for example, model.to("hpu").

  3. Permute the convolution weights and any other dependent tensors like ‘momentum’ to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) # permute momentum; True -> KCRS to RSCK format
  1. Start the training.

9.2.3.2. Scenario 2: Initializing Training from a Checkpoint

1. If checkpoint loading is followed by weight permutation mentioned in Scenario 1, first permute the weights and dependent tensors back to default PyTorch format (if not, go to step 2). For example:

permute_params(model, False...)  # permute conv weights; False ->  RSCK to KCRS format
permute_momentum(optimizer, False ...)  # permute momentum; False ->  RSCK to KCRS format
  1. Load the checkpoint and optimizer state dictionary.

  2. Move the model to ‘hpu’ device (if not already done).

  3. Permute the weights and dependent tensors to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) # permute momentum; True -> KCRS to RSCK format
  1. Start the training.

9.2.3.3. Scenario 3: Saving a Checkpoint

The convolution weights and dependent tensors on the ‘hpu’ device are in RSCK format.

  1. Permute the weights and dependent tensors to KCRS format. For example:

permute_params(model, False...)  # permute conv weights; False ->  RSCK to KCRS format
permute_momentum(optimizer, False ...) # permute momentum; False ->  RSCK to KCRS format
  1. Bring the trainable parameters of the model and optimizer tensors to the CPU and save.

  2. Move the trainable params and optimizer tensors to ‘hpu’.

  3. Permute the conv weight tensors and dependent tensors to RSCK format. For example:

permute_params(model, True...)  # permute conv weights; True -> KCRS to RSCK format
permute_momentum(optimizer, True ...) # permute momentum; True -> KCRS to RSCK format

The requirement for explicit addition of permutes with permute_params and permute_momentum in the model script will be removed in future releases.

9.2.4. Placement of Ops on HPU

Avoid execution of ops on CPU to get optimal performance on HPU. When a model is ported to run on HPU, the software stack decides which ops are placed on CPU and which are placed on the HPU.

This decision is based on whether the op is registered with PyTorch with HPU as the backend and whether the requested datatype is supported on HPU. Execution of an op automatically falls back to CPU if the op is not registered with its backend as HPU or if op is registered but the requested datatype is not supported on HPU.

To enable CPU fallback logs to check whether op execution fell back to CPU, set the environment variables as shown below:

PT_HPU_LOG_MOD_MASK=0x80 PT_HPU_LOG_TYPE_MASK=0x8

For example when aten::digamma op falls back once to CPU, you will see logs as shown below:

CPU fallback digamma : self=HPUBFloat16Type

Frequency of op and op name that were executed on CPU:
1       aten::digamma

9.2.5. Use of Channels Last Memory Format in PyTorch Habana Vision Topologies

Pytorch supports channels last memory format which generally performs better than default memory format on HPU. For this reason, the channels last memory format is the memory format supported by default for models such as ResNet, ResNext101, Unet, etc.

Example:

  • To convert a tensor img from NCHW to NHWC format, run:

img = img.contiguous(memory_format=torch.channels_last)
  • To convert a tensor img from NCDHW to NDHWC format, run:

img = img.contiguous(memory_format=torch.channels_last_3d)

9.2.6. Usage of Fused Operators

Create a custom op for optimizer (E.g. FusedSGD, FusedAdamW) and other complex ops (e.g FusedClipNorm) to minimize host performance overheads of running many small ops. This can improve the overlap of execution between host and device.

Habana Pytorch package provides some Custom Habana OPs for PyTorch.

Example:

Refer to the custom operator FusedSGD in ResNet50 FusedSGD

9.2.7. Adjust the Gradient Bucket Size in Multi-card/Multi-node Training

Based on the size of the model, the size of the gradient bucket can be adjusted to minimize the number of invocations of all-reduce in the backward pass of every training iteration. Documentation is available in Pytorch DDP.

Example:

In ResNet50, bucket size of 100MB is optimal whereas ResNext101 requires bucket size of 200MB. Refer to the implementation here.

9.2.8. Setting Gradients as View of Gradient Buckets in Multi-card/Multi-node Training

Pytorch DDP allows parameter gradient tensors to be views of the gradient bucket. This improves performance as device-to-device copies can be reduced and also reduces device memory requirement. Documentation is available in Pytorch DDP.

Example:

Refer to the implementation for ResNet50.

9.2.9. Reducing the Frequency of Printing Quantities (like loss, accuracy etc)

Printing these tensors in the training script requires these device tensors to be pulled to the host and therefore needs the device execution to finish. This can result in non-overlapped execution between host and device leading to sub-optimal performance.

9.2.10. Pinning Memory For Dataloader

Pinning the memory while instantiating the dataloader avoids a redundant copy in host during the training iteration. Refer to support in Pytorch Dataloader

Example:

Refer to the implementation for ResNet50 Dataloader.

9.3. Optimizations for Training using Pytorch Lightning

DDPPlugin provided in Habana’s Pytorch Lightning package supports features such setting size of gradient bucket, setting gradients view of allreduce buckets and static_graph. The first two are already described in the above section.

By setting static_graph when instantiating the Trainer, allreduce on unused parameters in the graph can be avoided. This also avoids overhead of copying them from host to device and vice versa after performing the allreduce.

Example:

Refer to the implementation for Unet2D.

9.4. Optimization in Training Platform

See Optimization in Training Platform.