General Model Optimizations
On this Page
General Model Optimizations¶
Placement of Ops on HPU¶
Avoid execution of ops on CPU to get optimal performance on HPU. When a model is ported to run on HPU, the Intel® Gaudi® software stack decides which ops are placed on CPU and which are placed on the HPU.
This decision is based on whether the op is registered with PyTorch with HPU as the backend and whether the requested datatype is supported on HPU. Execution of an op automatically falls back to CPU if the op is not registered with its backend as HPU or if op is registered but the requested datatype is not supported on HPU.
To enable CPU fallback logs to check whether op execution fell back to CPU, set the environment variables as shown below:
For example when aten::digamma op falls back once to CPU, you will see logs as shown below:
CPU fallback digamma : self=HPUBFloat16Type
Frequency of op and op name that were executed on CPU:
mark_step is added after backward and optimizer step; however, adding
mark_step to further optimize your model may be used.
mark_stepis added to avoid out of memory (OOM) issues. In cases where the size of the graph exceeds memory usage, the graph is broken using
mark_step. This reduces memory consumption, overcoming out of memory issues. See DeepSpeed Bert for example.
mark_stepcan also be used when the graph has static and dynamic shapes. Due to dynamicity, the graph is recompiled causing performance degradation. Adding
mark_stepafter static graph may reduce recompilations or recompile with small dynamic graphs.
A large batch size is, in general, beneficial for throughput. However, some limitations, listed below, apply when using large batch size:
Batch size is limited by Gaudi’s device memory (HBM) size. Usually, larger batch size means more memory consumption in device. Gaudi device memory size is a fixed size.
Large batch size cannot be used when low latency instead of throughput is required.
Large batch size in each Gaudi device may impact the convergence in data parallelism distributed training. For example, the highest global batch size that gives RN50 convergence is around 32K. This means that with an increasing number of Gaudi devices, batch size should be reduced in each device.
The below table provides some examples of batch sizes used in different models, all using mixed precision.
Bert Large pre-training Phase 1
Bert Large pre-training Phase 2
PyTorch Mixed Precision¶
For details on how to run mixed precision training of PyTorch models on Gaudi, refer to Mixed Precision Training with PyTorch Autocast.
Usage of Fused Operators¶
Create a custom op for optimizers (E.g. FusedSGD, FusedAdamW) and other complex ops (e.g FusedClipNorm) to minimize host performance overheads of running many small ops. This can improve the overlap of execution between host and device.
The Intel Gaudi PyTorch package provides some Handling Custom Ops.
Refer to the custom operator FusedSGD in ResNet50 FusedSGD
Using Fused Scaled Dot Product Attention (FusedSDPA)¶
FusedSDPA is a fused implementation of
torch.nn.functional.scaled_dot_product_attention() for Gaudi.
It maintains the same functionality and interface as the original function but with reduced memory usage.
FusedSDPA implements selected Flash Attention optimization approaches applicable to HPU.
FusedSDPA class takes several parameters (query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, scale=None)
and produces the output of scaled dot product attention. For further information on the functionality and parameters, refer to
Scaled Dot Product Attention.
attn_mask supports shape (N,…,1,S), in addition to the PyTorch specified shape (N,…,L,S).
The following describes FusedSDPA operation modes:
Recompute mode: This is the default mode. In this mode, necessary parts of the forward pass are recomputed during backward pass to reduce memory usage. This helps topologies to run with a higher batch size and/or sequence length. Note that in addition to memory optimizations related to recompute, this mode has additional memory optimizations. You can try running this mode even in inference scenarios that result in out of memory issues with non-Fused attention implementations such as
torch.nn.functional.scaled_dot_product_attention. This mode does not support broadcasting on batch size dimension. Query, key and value tensors should have same batch size.
No-recompute mode: In this mode, recomputing is not done. Therefore, this mode can have larger memory needs compared to recompute mode. Note that no-recompute mode still has some memory benefits compared to Non-Fused attention implementations.
For further details on FusedSDPA APIs, see hpex/kernels/FusedSDPA.
The supported data types are FP32 and BF16.
Memory usage profiling to characterize memory reduction on standard topologies is in progress. Users are advised to try both modes and choose the optimal mode for a given topology.
Perf Tool and Tensorboard Model Scanning¶
habana_perf_tool scans and provides guidance on existing log files generated for Tensorboard, without having to run the TensorBoard UI.
The tool scans the log file, shows a list of metrics that it measures, and then provides specific guidance for optimization, such as increasing
batch size or MME and TPC usages and timings. This analysis capability is also built directly into TensorBoard.
The tool can be run with the following command:
root@ubuntu2204:~/traces# habana_perf_tool --trace trace_example.json
Additional General Optimizations¶
Disk Caching Eviction Policy¶
Disk Caching is a mechanism that can limit the number of graph compilations for both training and inference workloads. The Intel Gaudi PyTorch bridge verifies if a recipe is already in cache (first in mem cache, then in disk cache). Refer to the Runtime Environment Variables section to configure the disk caching variables. If you wants to keep disk cache size under some predefined threshold, an eviction policy has been implemented.
When a compiled recipe is added to the cache, the algorithm checks whether the total size of all recipes fits the cache directory max size that is specified in
If the total cache directory size (including new recipe) exceeds the defined max size, the PyTorch bridge iterates over recipes in cache and removes the oldest recipes (by creation date on file system) first until the total size is under the limit.
Eviction is performed after the recipe is serialized and stored on disk by every worker.
In order to ensure that eviction logic removes recipes in a coherent way, only one process may perform eviction at a time. This is implemented using an
eviction.lockfile in disk cache directory and locking it using flock (https://linux.die.net/man/2/flock). The cache directory is locked by a particular worker only for eviction time.
Both serialization and eviction are performed in a separate thread, so graph launch is not delayed.
Since the size of recipe being stored is unknown prior to serialization, the eviction tries to keep the size of cache directory
<= 0.99 * <RECIPE_CACHE_SIZE_MB>. It limits the possibility of exceeding the specified cache dir size during next serialization.
If info logs from PT_HABHELPER are enabled,
LOG_LEVEL_PT_HABHELPER=2, then you should see the following PyTorch log message: “Removed <recipe id> successfully. Disk cache size after removal: <size>”. If too many eviction messages are observed, it may be time to reset the recipe cache directory size to a larger number. For specific models, you can finetune this size to get the best performance.
In the above example, recipes will be stored in
/tmp/iter1_recipe_cache/. The cache will be cleared at the beginning of each script execution and the size of recipe cache will be limited to 1024MB.
Adjust the Gradient Bucket Size in Multi-card/Multi-node Training¶
Based on the size of the model, the size of the gradient bucket can be adjusted to minimize the number of invocations of all-reduce in the backward pass of every training iteration. Documentation is available in PyTorch DDP.
In ResNet50, bucket size of 100MB is optimal whereas ResNext101 requires bucket size of 200MB. Refer to the implementation here.
Setting Gradients as View of Gradient Buckets in Multi-card/Multi-node Training¶
PyTorch DDP allows parameter gradient tensors to be views of the gradient bucket. This improves performance as device-to-device copies can be reduced and also reduces device memory requirement. Documentation is available in PyTorch DDP.
Refer to the implementation for ResNet50.
Reducing the Frequency of Printing Quantities¶
In cases where models have been fully optimized and set for production usage, some output messaging should be reduced or eliminated for best performance. The following are two specific examples:
Reporting loss using
loss.item()or calculating loss to display to the user
Showing the progress bar (using TDQM or other libraries) during runtime
Both of these items rely on additional communication between the host CPU and the Gaudi HPU to calculate loss or progress and then display the results. Printing these tensors in the training script requires these device tensors to be pulled to the host CPU and therefore needs the device execution to finish. This can result in non-overlapped execution between host and device leading to sub-optimal performance.
To reduce loss calculation or progress bar update, set the print frequency
--print-freq to a high value
or eliminate it altogether. You can set the
--print-freq variable in the model run command to a size similar
to the optimizer step size. For the progress bar, it is recommended to Wait until a run completes 20 or more iterations
to minimize unnecessary synchronization.
Pinning Memory For Dataloader¶
Pinning the memory while instantiating the dataloader avoids a redundant copy in host during the training iteration. Refer to support in PyTorch Dataloader
Refer to the implementation for ResNet50 Dataloader.
Avoiding Constant Variables in Loops¶
Avoiding the use of loop iterator variables within a loop may reduce the need for recompilations happening in consecutive iterations. Such a loop iterator variable may cause a creation of different constant operators in the execution graph every iteration.
For example, in the original V-Diffusion code the value of the iterator variable changes each time the loop iterates.
To avoid triggering recompilations after each iteration, the loop iterator variable
i is not used in
the Intel Gaudi V-Diffusion model.
for i in range(4, num_steps):
# The following 3 lines remove graph recompilation (variable "i" is not used)
t_1 = steps # before: steps[i]
t_2 = steps # before: steps[i+1]
steps = torch.roll(steps, shifts=(-1), dims=(0))
Weight sharing is a technique in which the module weights are shared among two or more layers. Weights can be shared using PyTorch with Gaudi only if they are created inside the module. You can find an example of weight sharing in BERT Pre-Training example on GitHub.
import habana_frameworks.torch.core as ht
# Example module
self.a = torch.nn.Parameter(torch.ones())
self.b = torch.nn.Parameter(torch.ones())
def forward(self, input):
c = self.a*input + self.b*input
module = WeightShareModule()
#module.a and module.b are shared
module.a = module.b
# Move the module to HPU device
Optimizing Training Using PyTorch Lightning¶
HPUParallelStrategy provided by PyTorch Lightning package supports features such setting size of gradient bucket, setting gradients view of allreduce buckets and static_graph.
By setting static_graph when instantiating the Trainer, allreduce on unused parameters in the graph can be avoided. This also avoids overhead of copying them from host to device and vice versa after performing the allreduce.
Refer to the implementation for Unet2D.