General Model Optimizations

Placement of Ops on HPU

Avoid execution of ops on CPU to get optimal performance on HPU. When a model is ported to run on HPU, the Intel® Gaudi® software stack decides which ops are placed on CPU and which are placed on the HPU.

This decision is based on whether the op is registered with PyTorch with HPU as the backend and whether the requested datatype is supported on HPU. Execution of an op automatically falls back to CPU if the op is not registered with its backend as HPU or if op is registered but the requested datatype is not supported on HPU.

To enable CPU fallback logs to check whether op execution fell back to CPU, set the environment variables as shown below:


For example when aten::digamma op falls back once to CPU, you will see logs as shown below:

CPU fallback digamma : self=HPUBFloat16Type

Frequency of op and op name that were executed on CPU:
1       aten::digamma

Usage of mark_step

In general, mark_step is added after backward and optimizer step; however, adding mark_step to further optimize your model may be used.

  • mark_step is added to avoid out of memory (OOM) issues. In cases where the size of the graph exceeds memory usage, the graph is broken using mark_step. This reduces memory consumption, overcoming out of memory issues. See DeepSpeed Bert for example.

  • mark_step can also be used when the graph has static and dynamic shapes. Due to dynamicity, the graph is recompiled causing performance degradation. Adding mark_step after static graph may reduce recompilations or recompile with small dynamic graphs.

Batch Size

A large batch size is, in general, beneficial for throughput. However, some limitations, listed below, apply when using large batch size:

  1. Batch size is limited by Gaudi’s device memory (HBM) size. Usually, larger batch size means more memory consumption in device. Gaudi device memory size is a fixed size.

  2. Large batch size cannot be used when low latency instead of throughput is required.

  3. Large batch size in each Gaudi device may impact the convergence in data parallelism distributed training. For example, the highest global batch size that gives RN50 convergence is around 32K. This means that with an increasing number of Gaudi devices, batch size should be reduced in each device.

The below table provides some examples of batch sizes used in different models, all using mixed precision.


Batch Size



Bert Large pre-training Phase 1


Bert Large pre-training Phase 2




PyTorch Mixed Precision

For details on how to run mixed precision training of PyTorch models on Gaudi, refer to Mixed Precision Training with PyTorch Autocast.

Usage of Fused Operators

Create a custom op for optimizers (E.g. FusedSGD, FusedAdamW) and other complex ops (e.g FusedClipNorm) to minimize host performance overheads of running many small ops. This can improve the overlap of execution between host and device.

The Intel Gaudi PyTorch package provides some Handling Custom Ops.


Refer to the custom operator FusedSGD in ResNet50 FusedSGD

Using Fused Scaled Dot Product Attention (FusedSDPA)

FusedSDPA is a fused implementation of torch.nn.functional.scaled_dot_product_attention() for Gaudi. It maintains the same functionality and interface as the original function but with reduced memory usage.

FusedSDPA implements selected Flash Attention optimization approaches applicable to HPU.

The FusedSDPA class takes several parameters (query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None) and produces the output of scaled dot product attention. For further information on the functionality and parameters, refer to Scaled Dot Product Attention.

attn_mask supports shape (N,…,1,S), in addition to the PyTorch specified shape (N,…,L,S).

The following describes FusedSDPA operation modes:

  • Recompute mode: This is the default mode. In this mode, necessary parts of the forward pass are recomputed during backward pass to reduce memory usage. This helps topologies to run with a higher batch size and/or sequence length. In addition to memory optimizations related to recompute, this mode has additional memory optimizations. You can try running this mode even in inference scenarios that result in out of memory issues with non-fused attention implementations. This mode does not support broadcasting on batch size dimension. Query, key and value tensors should have same batch size.

  • No-recompute mode: In this mode, recomputing is not done. Therefore, this mode can have larger memory needs compared to recompute mode. This mode still has memory benefits compared to non-fused attention implementations.

For further details on FusedSDPA APIs, see hpex/kernels/FusedSDPA.


  • The supported data types are FP32 and BF16.

  • GPU Migration Toolkit can only be used with FusedSDPA no-recompute mode. Using FusedSDPA recompute mode with this package is currently not supported. This will be changed in future releases.

Memory usage profiling to characterize memory reduction on standard topologies is in progress. Users are advised to try both modes and choose the optimal mode for a given topology.

Perf Tool and Tensorboard Model Scanning

The habana_perf_tool scans and provides guidance on existing log files generated for Tensorboard, without having to run the TensorBoard UI. The tool scans the log file, shows a list of metrics that it measures, and then provides specific guidance for optimization, such as increasing batch size or MME and TPC usages and timings. This analysis capability is also built directly into TensorBoard. The tool can be run with the following command:

root@ubuntu2204:~/traces# habana_perf_tool --trace trace_example.json

Additional General Optimizations

Disk Caching Eviction Policy

Disk Caching is a mechanism that can limit the number of graph compilations for both training and inference workloads. The Intel Gaudi PyTorch bridge verifies if a recipe is already in cache (first in mem cache, then in disk cache). Refer to the Runtime Environment Variables section to configure the disk caching variables. If you wants to keep disk cache size under some predefined threshold, an eviction policy has been implemented.

When a compiled recipe is added to the cache, the algorithm checks whether the total size of all recipes fits the cache directory max size that is specified in PT_HPU_RECIPE_CACHE_CONFIG. If the total cache directory size (including new recipe) exceeds the defined max size, the PyTorch bridge iterates over recipes in cache and removes the oldest recipes (by creation date on file system) first until the total size is under the limit.

  • Eviction is performed after the recipe is serialized and stored on disk by every worker.

  • In order to ensure that eviction logic removes recipes in a coherent way, only one process may perform eviction at a time. This is implemented using an eviction.lock file in disk cache directory and locking it using flock ( The cache directory is locked by a particular worker only for eviction time.

  • Both serialization and eviction are performed in a separate thread, so graph launch is not delayed.

  • Since the size of recipe being stored is unknown prior to serialization, the eviction tries to keep the size of cache directory <= 0.99 * <RECIPE_CACHE_SIZE_MB>. It limits the possibility of exceeding the specified cache dir size during next serialization.

  • If info logs from PT_HABHELPER are enabled, LOG_LEVEL_PT_HABHELPER=2, then you should see the following PyTorch log message: “Removed <recipe id> successfully. Disk cache size after removal: <size>”. If too many eviction messages are observed, it may be time to reset the recipe cache directory size to a larger number. For specific models, you can finetune this size to get the best performance.


PT_HPU_RECIPE_CACHE_CONFIG=/tmp/iter1_recipe_cache/,true,1024 \

In the above example, recipes will be stored in /tmp/iter1_recipe_cache/. The cache will be cleared at the beginning of each script execution and the size of recipe cache will be limited to 1024MB.

Adjust the Gradient Bucket Size in Multi-card/Multi-node Training

Based on the size of the model, the size of the gradient bucket can be adjusted to minimize the number of invocations of all-reduce in the backward pass of every training iteration. Documentation is available in PyTorch DDP.


In ResNet50, bucket size of 100MB is optimal whereas ResNext101 requires bucket size of 200MB. Refer to the implementation here.

Setting Gradients as View of Gradient Buckets in Multi-card/Multi-node Training

PyTorch DDP allows parameter gradient tensors to be views of the gradient bucket. This improves performance as device-to-device copies can be reduced and also reduces device memory requirement. Documentation is available in PyTorch DDP.


Refer to the implementation for ResNet50.

Reducing the Frequency of Printing Quantities

In cases where models have been fully optimized and set for production usage, some output messaging should be reduced or eliminated for best performance. The following are two specific examples:

  • Reporting loss using loss.item() or calculating loss to display to the user

  • Showing the progress bar (using TDQM or other libraries) during runtime

Both of these items rely on additional communication between the host CPU and the Gaudi HPU to calculate loss or progress and then display the results. Printing these tensors in the training script requires these device tensors to be pulled to the host CPU and therefore needs the device execution to finish. This can result in non-overlapped execution between host and device leading to sub-optimal performance.

To reduce loss calculation or progress bar update, set the print frequency --print-freq to a high value or eliminate it altogether. You can set the --print-freq variable in the model run command to a size similar to the optimizer step size. For the progress bar, it is recommended to Wait until a run completes 20 or more iterations to minimize unnecessary synchronization.

Pinning Memory For Dataloader

Pinning the memory while instantiating the dataloader avoids a redundant copy in host during the training iteration. Refer to support in PyTorch Dataloader


Refer to the implementation for ResNet50 Dataloader.

Avoiding Constant Variables in Loops

Avoiding the use of loop iterator variables within a loop may reduce the need for recompilations happening in consecutive iterations. Such a loop iterator variable may cause a creation of different constant operators in the execution graph every iteration.

For example, in the original V-Diffusion code the value of the iterator variable changes each time the loop iterates. To avoid triggering recompilations after each iteration, the loop iterator variable i is not used in the Intel Gaudi V-Diffusion model.

for i in range(4, num_steps):
     # The following 3 lines remove graph recompilation (variable "i" is not used)
     t_1 = steps[0] # before: steps[i]
     t_2 = steps[1] # before: steps[i+1]
     steps = torch.roll(steps, shifts=(-1), dims=(0))


Refer to the implementation for the Intel Gaudi V-Diffusion model and compare it with the original V-Diffusion code.

Weight Sharing

Weight sharing is a technique in which the module weights are shared among two or more layers. Weights can be shared using PyTorch with Gaudi only if they are created inside the module. You can find an example of weight sharing in BERT Pre-Training example on GitHub.

import torch
import habana_frameworks.torch.core as ht

# Example module
class WeightShareModule(torch.nn.Module):
 def __init__(self):
     super(WeightShareModule, self).__init__()
     self.a = torch.nn.Parameter(torch.ones([2]))
     self.b = torch.nn.Parameter(torch.ones([2]))
 def forward(self, input):
     c = self.a*input + self.b*input
     return c

module = WeightShareModule()
#module.a and module.b are shared
module.a = module.b
# Move the module to HPU device"hpu")

Optimizing Training Using PyTorch Lightning

HPUParallelStrategy provided by PyTorch Lightning package supports features such setting size of gradient bucket, setting gradients view of allreduce buckets and static_graph.

By setting static_graph when instantiating the Trainer, allreduce on unused parameters in the graph can be avoided. This also avoids overhead of copying them from host to device and vice versa after performing the allreduce.


Refer to the implementation for Unet2D.