Optimizations of PyTorch Models

The following optimization methods can be applied to PyTorch models run on the Intel® Gaudi® AI accelerator to enhance their performance.

General Model Optimizations

The optimization methods below can be used with all PyTorch models.

Placement of Ops on HPU

When a model is ported to HPU, the Intel Gaudi software stack distributes ops between CPU and HPU. In order to achieve an optimal performance on HPU, avoid execution of ops on CPU.

The distribution is based on whether the op is registered on PyTorch with HPU backend and whether the requested data type is supported on HPU. Execution of an op automatically falls back to CPU if it is not registered with its backend as HPU or if the op is registered but the requested data type is not supported on HPU.

To enable CPU fallback logs to check whether ops were executed on CPU, set the environment variable as shown below:

LOG_LEVEL_PT_FALLBACK=1

Example:

When aten::digamma op falls back to CPU, the logs display the below:

CPU fallback digamma : self=HPUBFloat16Type

Frequency of op and op name that were executed on CPU:
1       aten::digamma

Usage of mark_step

mark_step is added after backward and optimizer step; however, adding mark_step to further optimize your model may be used. The following are examples of when adding mark_step is beneficial:

  • mark_step is added to avoid Out of Memory issues. In cases where the size of the graph exceeds memory usage, the graph is broken using mark_step. This reduces memory consumption, overcoming Out of Memory issues. See DeepSpeed Bert for example.

  • mark_step can also be used when the graph has static and dynamic shapes. Due to dynamicity, the graph is recompiled causing performance degradation. Adding mark_step after static graph may reduce recompilations or recompile with small dynamic graphs.

Batch Size

Throughput is usually improved when the batch sizes are large. However, there are limitations that apply when using large batch size:

  • Batch size is limited by Gaudi’s device memory (HBM) size that is fixed. Usually, larger batch size means more memory consumption in the device.

  • Large batch size cannot be used when low latency instead of throughput is required.

  • Large batch size in each Gaudi device may impact the convergence in data parallelism distributed training. For example, the highest global batch size that gives ResNet50 convergence is around 32K. This means that with an increasing number of Gaudi devices, batch size should be reduced in each device.

The below table provides examples of batch sizes used in different models, all using mixed precision.

Models

Batch Size

ResNet50

256

Bert Large pre-training Phase 1

64

Bert Large pre-training Phase 2

8

MaskRCNN

4

PyTorch Mixed Precision

For details on how to run mixed precision training of PyTorch models on Gaudi, refer to Mixed Precision Training with PyTorch Autocast.

Usage of Fused Optimizers and Custom Ops

Create a custom op for PyTorch optimizers (FusedSGD, FusedAdamW) and other complex ops (FusedClipNorm) to minimize host performance overheads of running many small ops. This can improve the overlap of execution between host and device.

The Intel Gaudi PyTorch package provides its own implementation of PyTorch ops customized for Gaudi. For more details, see Fused Optimizers and Custom Ops for Intel Gaudi.

Example:

For the custom FusedSGD operator, refer to ResNet50 FusedSGD.

Perf Tool and TensorBoard Model Scanning

The habana_perf_tool scans and provides guidance on existing log files generated for TensorBoard, without having to run the TensorBoard UI. The tool scans the log file, shows a list of metrics that it measures, and then provides specific guidance for optimization, such as increasing batch size or MME and TPC usages and timings. This analysis capability is also built directly into TensorBoard. The tool can be initiated with the following command:

root@ubuntu2204:~/traces# habana_perf_tool --trace trace_example.json

Model-specific Optimizations

The optimization methods below are supported on specific PyTorch models.

Using Fused Scaled Dot Product Attention (FusedSDPA)

FusedSDPA is a fused implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. It maintains the same functionality as the original function with reduced memory usage and implements selected Flash Attention optimization approaches. For further information on the original functionality and parameters, refer to Scaled Dot Product Attention.

Note

  • FusedSDPA is designed to accelerate training and inference of Transformer-based models.

  • The supported data types are FP32 and BF16. Running inference with FP8 data type is possible with Intel Neural Compressor (INC).

  • Memory usage profiling to characterize memory reduction on standard topologies is in progress. Users are advised to try both modes and choose the optimal mode for a given topology.

FusedSDPA Custom Features

The features described in the following sections are configured via FusedSDPA custom API parameters. For further details, see hpex/kernels/FusedSDPA.

Operation Modes

FusedSDPA has two operation modes: Recompute mode (default) and No-recompute mode. The operation mode can be selected either using the context API variable or the custom API parameter.

  • Recompute mode (recompute_mode=None) - This is the default mode. In this mode, necessary parts of the forward pass are recomputed during backward pass to reduce memory usage. This helps topologies to run with a higher batch size and/or sequence length. In addition to memory optimizations related to recompute, this mode has additional memory optimizations. You can try running this mode in inference scenarios that result in Out of Memory issues with non-fused attention implementations. This mode does not support broadcasting on batch size dimension. Query, key and value tensors should have same batch size.

  • No-recompute mode (recompute_mode=False) - In this mode, recomputing is not done. Therefore, this mode can have larger memory needs compared to recompute mode. This mode still has memory benefits compared to non-fused attention implementations.

Example:

import torch
from habana_frameworks.torch.hpex.kernels import FusedSDPA
import habana_frameworks.torch.hpu as ht

query = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
key = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
value = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
    sdpa_out = FusedSDPA.apply(query, key, value, None, 0.0, True, None, False)
    print(sdpa_out.to("cpu"))
Fast Softmax

FusedSDPA supports fast Softmax function execution with softmax_mode='fast' enabled. If the default softmax_mode='None' is set, the default Softmax is used. The feature is supported in both non-triangular and triangular masking modes.

Example:

import torch
from habana_frameworks.torch.hpex.kernels import FusedSDPA
import habana_frameworks.torch.hpu as ht

query = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
key = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
value = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
with ht.sdp_kernel(enable_recompute = True):
    sdpa_out = FusedSDPA.apply(query, key, value, None, 0.0, True, None, 'fast')
    print(sdpa_out.to("cpu"))

Note

  • Using fast Softmax may affect inference accuracy.

  • Only BF16 data type is supported with fast Softmax.

  • Fast Softmax is not supported when running training in recompute mode with is_causal = False.

Valid Sequence Length

The valid sequence length represents the actual length of the sequence, excluding any padding. Sequences with varying lengths can be padded to a common maximum length, either at the beginning (left padding) or the end (right padding). The region corresponding to the padding is ignored during Softmax calculations when attention computations are run. In certain topologies, combining triangular masking (is_causal=True) with specifying the valid sequence length allows to ignore the invalid areas more efficiently. FusedSDPA’s valid_seq_len and seq_padding_type API parameters facilitate this optimization.

The example below illustrates a case involving a batch of three sequences with a maximum length of 128. The actual sequence lengths are 100, 120, and 80, with padding added after each sequence:

import torch
from habana_frameworks.torch.hpex.kernels import FusedSDPA
import habana_frameworks.torch.hpu as ht

  query = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu") # seq len=128 after padding
  key = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu")   # seq len=128 after padding
  value = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu") # seq len=128 after padding
  valid_s_len = torch.tensor([100, 120, 80], dtype=torch.int32, device="hpu") # actual seq len of 100,120,80
  sdpa_out = FusedSDPA.apply(query, key, value, None, 0.0, True, None, 'fast', False, valid_s_len, "right")
  print(sdpa_out.to("cpu"))

Note

  • This feature is supported only with is_causal=True and attn_mask=None.

  • seq_padding_type is relevant only when valid_seq_len is not None.

Returning Dropout Mask

FusedSDPA returns dropout mask if return_dropout_mask=True. The parameter is used for debug purposes only.

Example:

import torch
from habana_frameworks.torch.hpex.kernels import FusedSDPA
import habana_frameworks.torch.hpu as ht

  query = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
  key = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
  value = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu")

  sdpa_out, drp_out_mask = FusedSDPA.apply(query, key, value, None, 0.1, True, None, 'None', False, None, "right", True)
  print(sdpa_out.to("cpu"))
  print(drp_out_mask.to("cpu"))

Note

Returning dropout mask is supported only in no-recompute mode.

Disk Caching Eviction Policy

Disk caching is a mechanism that limits the number of graph compilations for both training and inference workloads. Initially, the Intel Gaudi PyTorch bridge checks if a recipe is already cached in memory and then in disk cache. Refer to the Runtime Environment Variables section to configure the disk caching variables. If you want to keep the disk cache size under a predefined threshold, an eviction policy can be implemented.

When a compiled recipe is added to the cache, the algorithm checks whether the total size of all recipes fits the cache directory max size that is specified in PT_HPU_RECIPE_CACHE_CONFIG. When the total size of the cache directory, including new recipes, exceeds the defined maximum size, the Intel Gaudi PyTorch bridge iterates through the recipes in the cache. It removes the oldest recipes on the file system first until the total size is under the limit.

Highlights:

  • Eviction is performed after the recipe is serialized and stored on disk by every worker.

  • To ensure that eviction logic removes recipes in a coherent way, only one process may perform eviction at a time. This is implemented using an eviction.lock file in disk cache directory and locking it using flock (https://linux.die.net/man/2/flock). The cache directory is locked by a particular worker only for eviction time.

  • Both serialization and eviction are performed in a separate thread, so graph launch is not delayed.

  • Since the size of recipe being stored is unknown prior to serialization, the eviction tries to keep the size of cache directory <= 0.99 * <RECIPE_CACHE_SIZE_MB>. It limits the possibility of exceeding the specified cache dir size during next serialization.

  • If info logs from PT_HABHELPER are enabled, LOG_LEVEL_PT_HABHELPER=2, then you should see the following PyTorch log message: “Removed <recipe id> successfully. Disk cache size after removal: <size>”. If too many eviction messages are observed, it is recommended to reset the recipe cache directory size to a larger number. For specific models, you can fine-tune this size to get the best performance.

Example:

PT_HPU_RECIPE_CACHE_CONFIG=/tmp/iter1_recipe_cache/,true,1024 \
python your_model.py

In the above example, the recipes are stored in /tmp/iter1_recipe_cache/. The cache is cleared at the beginning of each script execution and the size of the recipe cache is limited to 1024MB.

Adjust the Gradient Bucket Size in Multi-Card/Server Training

Based on the size of the model, the size of the gradient bucket can be adjusted to minimize the number of allreduce invocations in the backward pass of every training iteration. Refer to PyTorch DDP for more details.

Example:

In ResNet50, bucket size of 100MB is optimal whereas ResNext101 requires bucket size of 200MB. Refer to the implementation here.

Setting Gradients as View of Gradient Buckets in Multi-Card/Server Training

PyTorch DDP allows parameter gradient tensors to be views of the gradient bucket. This improves performance as device-to-device copies can be reduced and also reduces device memory requirement. Refer to PyTorch DDP for more details.

Example:

Refer to ResNet50.

Reducing the Printing Quantities Frequency

Some output messaging should be reduced or eliminated for enhanced performance when models have been fully optimized and set up for production use. Two examples are provided below:

  • Reporting loss using loss.item() or calculating loss to display.

  • Showing the progress bar (using TDQM or other libraries) during runtime.

Both of these items rely on additional communication between the host CPU and Gaudi to calculate loss or progress and then display the results. Printing these tensors in the training script requires pulling the device tensors to the host CPU and, therefore, requires device execution to finish. This can result in non-overlapped execution between host and device leading to sub-optimal performance.

To reduce loss calculation or progress bar update, set the print frequency --print-freq to a high value or eliminate it altogether. You can set the --print-freq variable in the model run command to a size similar to the optimizer step size. For the progress bar, it is recommended to wait until a run completes 20 or more iterations to minimize unnecessary synchronization.

Pinning Memory For Dataloader

Pinning the memory while instantiating the dataloader avoids a redundant copy in host during the training iteration. Refer to support PyTorch Dataloader for more details.

Example:

Refer to ResNet50 Dataloader.

Avoiding Constant Variables in Loops

Avoiding loop iterator variables within a loop may reduce the recompilations occurrences in consecutive iterations. This loop iterator variable can create different constant operators in the execution graph each time the loop is executed.

For example, in the original V-Diffusion code the value of the iterator variable changes each time the loop iterates. To avoid triggering recompilations after each iteration, the loop iterator variable i is not used in the Intel Gaudi V-Diffusion model. See the example below:

for i in range(4, num_steps):
     # The following 3 lines remove graph recompilation (variable "i" is not used)
     t_1 = steps[0] # before: steps[i]
     t_2 = steps[1] # before: steps[i+1]
     steps = torch.roll(steps, shifts=(-1), dims=(0))

Example:

Refer to the implementation for the Intel Gaudi V-Diffusion model and compare it with the original V-Diffusion code.

Weight Sharing

Weight sharing is a technique in which the module weights are shared among two or more layers. Weights can be shared using PyTorch with Gaudi only if they are created inside the module. See the example below:

import torch
import habana_frameworks.torch.core as ht

# Example module
class WeightShareModule(torch.nn.Module):
 def __init__(self):
     super(WeightShareModule, self).__init__()
     self.a = torch.nn.Parameter(torch.ones([2]))
     self.b = torch.nn.Parameter(torch.ones([2]))
 def forward(self, input):
     c = self.a*input + self.b*input
     return c

module = WeightShareModule()
#module.a and module.b are shared
module.a = module.b
# Move the module to HPU device
module.to("hpu")

Example:

Refer to BERT Pre-Training on GitHub.

Switch Host Memory Allocator

For deep learning workloads, jemalloc or TCMalloc achieve better performance by reusing memory as much as possible. Both Jemalloc and TCMalloc are pre-installed using the Intel Gaudi dockers.

  • Jemalloc - A general purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.

  • TCMalloc - Features optimizations to speed up program executions including holding memory in caches to speed up access of commonly-used objects. Holding such caches even after deallocation also helps avoid costly system calls if such memory is later re-allocated.

By default, HPU uses TCMalloc allocator for host memory. For some workloads, this can cause host Out-Of-Memory issues as it holds memory in cache. This can be mitigated adjusting TCMalloc cache size via its config or switching to the jemalloc allocator using the LD_PRELOAD environment variable.

To switch to the jemalloc allocator:

  • Clear the existing allocator from LD_PRELOAD

  • export LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD

Optimizing Training Using PyTorch Lightning

HPUParallelStrategy provided by PyTorch Lightning package supports the following features:

  • Setting size of gradient bucket

  • Setting gradients view of allreduce buckets

  • static_graph

By setting static_graph when instantiating the Trainer, allreduce on unused parameters in the graph can be avoided. This also bypasses the overhead of copying them from host to device and vice versa after performing allreduce.

Example:

Refer to Unet2D for the implementation.