Optimizations of PyTorch Models
On this Page
Optimizations of PyTorch Models¶
The following optimization methods can be applied to PyTorch models run on the Intel® Gaudi® AI accelerator to enhance their performance.
General Model Optimizations¶
The optimization methods below can be used with all PyTorch models.
Placement of Ops on HPU¶
When a model is ported to HPU, the Intel Gaudi software stack distributes ops between CPU and HPU. In order to achieve an optimal performance on HPU, avoid execution of ops on CPU.
The distribution is based on whether the op is registered on PyTorch with HPU backend and whether the requested data type is supported on HPU. Execution of an op automatically falls back to CPU if it is not registered with its backend as HPU or if the op is registered but the requested data type is not supported on HPU.
To enable CPU fallback logs to check whether ops were executed on CPU, set the environment variable as shown below:
LOG_LEVEL_PT_FALLBACK=1
Example:
When aten::digamma op falls back to CPU, the logs display the below:
CPU fallback digamma : self=HPUBFloat16Type
Frequency of op and op name that were executed on CPU:
1 aten::digamma
Usage of mark_step
¶
mark_step
is added after backward and optimizer step; however, adding mark_step
to further optimize your model may be used.
The following are examples of when adding mark_step
is beneficial:
mark_step
is added to avoid Out of Memory issues. In cases where the size of the graph exceeds memory usage, the graph is broken usingmark_step
. This reduces memory consumption, overcoming Out of Memory issues. See DeepSpeed Bert for example.mark_step
can also be used when the graph has static and dynamic shapes. Due to dynamicity, the graph is recompiled causing performance degradation. Addingmark_step
after static graph may reduce recompilations or recompile with small dynamic graphs.
Batch Size¶
Throughput is usually improved when the batch sizes are large. However, there are limitations that apply when using large batch size:
Batch size is limited by Gaudi’s device memory (HBM) size that is fixed. Usually, larger batch size means more memory consumption in the device.
Large batch size cannot be used when low latency instead of throughput is required.
Large batch size in each Gaudi device may impact the convergence in data parallelism distributed training. For example, the highest global batch size that gives ResNet50 convergence is around 32K. This means that with an increasing number of Gaudi devices, batch size should be reduced in each device.
The below table provides examples of batch sizes used in different models, all using mixed precision.
Models |
Batch Size |
---|---|
ResNet50 |
256 |
Bert Large pre-training Phase 1 |
64 |
Bert Large pre-training Phase 2 |
8 |
MaskRCNN |
4 |
PyTorch Mixed Precision¶
For details on how to run mixed precision training of PyTorch models on Gaudi, refer to Mixed Precision Training with PyTorch Autocast.
Usage of Fused Optimizers and Custom Ops¶
Create a custom op for PyTorch optimizers (FusedSGD, FusedAdamW) and other complex ops (FusedClipNorm) to minimize host performance overheads of running many small ops. This can improve the overlap of execution between host and device.
The Intel Gaudi PyTorch package provides its own implementation of PyTorch ops customized for Gaudi. For more details, see Fused Optimizers and Custom Ops for Intel Gaudi.
Example:
For the custom FusedSGD operator, refer to ResNet50 FusedSGD.
Perf Tool and TensorBoard Model Scanning¶
The habana_perf_tool
scans and provides guidance on existing log files generated for TensorBoard, without having to run the TensorBoard UI.
The tool scans the log file, shows a list of metrics that it measures, and then provides specific guidance for optimization, such as increasing
batch size or MME and TPC usages and timings. This analysis capability is also built directly into TensorBoard.
The tool can be initiated with the following command:
root@ubuntu2204:~/traces# habana_perf_tool --trace trace_example.json
Model-specific Optimizations¶
The optimization methods below are supported on specific PyTorch models.
Using Fused Scaled Dot Product Attention (FusedSDPA)¶
FusedSDPA is a fused implementation of torch.nn.functional.scaled_dot_product_attention()
for Gaudi.
It maintains the same functionality as the original function with reduced memory usage and implements selected Flash Attention optimization approaches.
For further information on the original functionality and parameters, refer to
Scaled Dot Product Attention.
Note
FusedSDPA is designed to accelerate training and inference of Transformer-based models.
The supported data types are FP32 and BF16. Running inference with FP8 data type is possible with Intel Neural Compressor (INC).
Memory usage profiling to characterize memory reduction on standard topologies is in progress. Users are advised to try both modes and choose the optimal mode for a given topology.
FusedSDPA Custom Features¶
The features described in the following sections are configured via FusedSDPA custom API parameters. For further details, see hpex/kernels/FusedSDPA.
Operation Modes¶
FusedSDPA has two operation modes: Recompute mode (default) and No-recompute mode. The operation mode can be selected either using the context API variable or the custom API parameter.
Recompute mode (
recompute_mode=None
) - This is the default mode. In this mode, necessary parts of the forward pass are recomputed during backward pass to reduce memory usage. This helps topologies to run with a higher batch size and/or sequence length. In addition to memory optimizations related to recompute, this mode has additional memory optimizations. You can try running this mode in inference scenarios that result in Out of Memory issues with non-fused attention implementations. This mode does not support broadcasting on batch size dimension. Query, key and value tensors should have same batch size.No-recompute mode (
recompute_mode=False
) - In this mode, recomputing is not done. Therefore, this mode can have larger memory needs compared to recompute mode. This mode still has memory benefits compared to non-fused attention implementations.
Example:
import torch
from habana_frameworks.torch.hpex.kernels import FusedSDPA
import habana_frameworks.torch.hpu as ht
query = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
key = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
value = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
sdpa_out = FusedSDPA.apply(query, key, value, None, 0.0, True, None, False)
print(sdpa_out.to("cpu"))
Fast Softmax¶
FusedSDPA supports fast Softmax function execution with softmax_mode='fast'
enabled. If the default softmax_mode='None'
is set, the default Softmax is used.
The feature is supported in both non-triangular and triangular masking modes.
Example:
import torch
from habana_frameworks.torch.hpex.kernels import FusedSDPA
import habana_frameworks.torch.hpu as ht
query = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
key = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
value = torch.rand(32, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
with ht.sdp_kernel(enable_recompute = True):
sdpa_out = FusedSDPA.apply(query, key, value, None, 0.0, True, None, 'fast')
print(sdpa_out.to("cpu"))
Note
Using fast Softmax may affect inference accuracy.
Only BF16 data type is supported with fast Softmax.
Fast Softmax is not supported when running training in recompute mode with
is_causal = False
.
Valid Sequence Length¶
The valid sequence length represents the actual length of the sequence, excluding any padding.
Sequences with varying lengths can be padded to a common maximum length, either at the beginning (left padding) or the end (right padding).
The region corresponding to the padding is ignored during Softmax calculations when attention computations are run.
In certain topologies, combining triangular masking (is_causal=True
) with specifying the valid sequence length
allows to ignore the invalid areas more efficiently. FusedSDPA’s valid_seq_len
and seq_padding_type
API parameters facilitate this optimization.
The example below illustrates a case involving a batch of three sequences with a maximum length of 128. The actual sequence lengths are 100, 120, and 80, with padding added after each sequence:
import torch from habana_frameworks.torch.hpex.kernels import FusedSDPA import habana_frameworks.torch.hpu as ht query = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu") # seq len=128 after padding key = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu") # seq len=128 after padding value = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu") # seq len=128 after padding valid_s_len = torch.tensor([100, 120, 80], dtype=torch.int32, device="hpu") # actual seq len of 100,120,80 sdpa_out = FusedSDPA.apply(query, key, value, None, 0.0, True, None, 'fast', False, valid_s_len, "right") print(sdpa_out.to("cpu"))
Note
This feature is supported only with
is_causal=True
andattn_mask=None
.seq_padding_type
is relevant only whenvalid_seq_len
is notNone
.
Returning Dropout Mask¶
FusedSDPA returns dropout mask if return_dropout_mask=True
. The parameter is used for debug purposes only.
Example:
import torch
from habana_frameworks.torch.hpex.kernels import FusedSDPA
import habana_frameworks.torch.hpu as ht
query = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
key = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
value = torch.rand(3, 8, 128, 64, dtype=torch.bfloat16, device="hpu")
sdpa_out, drp_out_mask = FusedSDPA.apply(query, key, value, None, 0.1, True, None, 'None', False, None, "right", True)
print(sdpa_out.to("cpu"))
print(drp_out_mask.to("cpu"))
Note
Returning dropout mask is supported only in no-recompute mode.
Disk Caching Eviction Policy¶
Disk caching is a mechanism that limits the number of graph compilations for both training and inference workloads. Initially, the Intel Gaudi PyTorch bridge checks if a recipe is already cached in memory and then in disk cache. Refer to the Runtime Environment Variables section to configure the disk caching variables. If you want to keep the disk cache size under a predefined threshold, an eviction policy can be implemented.
When a compiled recipe is added to the cache, the algorithm checks whether the total size of all recipes fits the cache directory max size that is specified in PT_HPU_RECIPE_CACHE_CONFIG
.
When the total size of the cache directory, including new recipes, exceeds the defined maximum size,
the Intel Gaudi PyTorch bridge iterates through the recipes in the cache. It removes the oldest recipes on the file
system first until the total size is under the limit.
Highlights:
Eviction is performed after the recipe is serialized and stored on disk by every worker.
To ensure that eviction logic removes recipes in a coherent way, only one process may perform eviction at a time. This is implemented using an
eviction.lock
file in disk cache directory and locking it using flock (https://linux.die.net/man/2/flock). The cache directory is locked by a particular worker only for eviction time.Both serialization and eviction are performed in a separate thread, so graph launch is not delayed.
Since the size of recipe being stored is unknown prior to serialization, the eviction tries to keep the size of cache directory
<= 0.99 * <RECIPE_CACHE_SIZE_MB>
. It limits the possibility of exceeding the specified cache dir size during next serialization.If info logs from
PT_HABHELPER
are enabled,LOG_LEVEL_PT_HABHELPER=2
, then you should see the following PyTorch log message: “Removed <recipe id> successfully. Disk cache size after removal: <size>”. If too many eviction messages are observed, it is recommended to reset the recipe cache directory size to a larger number. For specific models, you can fine-tune this size to get the best performance.
Example:
PT_HPU_RECIPE_CACHE_CONFIG=/tmp/iter1_recipe_cache/,true,1024 \
python your_model.py
In the above example, the recipes are stored in /tmp/iter1_recipe_cache/
. The cache is cleared at the beginning of each script execution and the size of the recipe cache is limited to 1024MB.
Adjust the Gradient Bucket Size in Multi-Card/Server Training¶
Based on the size of the model, the size of the gradient bucket can be adjusted to minimize the number of allreduce invocations in the backward pass of every training iteration. Refer to PyTorch DDP for more details.
Example:
In ResNet50, bucket size of 100MB is optimal whereas ResNext101 requires bucket size of 200MB. Refer to the implementation here.
Setting Gradients as View of Gradient Buckets in Multi-Card/Server Training¶
PyTorch DDP allows parameter gradient tensors to be views of the gradient bucket. This improves performance as device-to-device copies can be reduced and also reduces device memory requirement. Refer to PyTorch DDP for more details.
Example:
Refer to ResNet50.
Reducing the Printing Quantities Frequency¶
Some output messaging should be reduced or eliminated for enhanced performance when models have been fully optimized and set up for production use. Two examples are provided below:
Reporting loss using
loss.item()
or calculating loss to display.Showing the progress bar (using TDQM or other libraries) during runtime.
Both of these items rely on additional communication between the host CPU and Gaudi to calculate loss or progress and then display the results. Printing these tensors in the training script requires pulling the device tensors to the host CPU and, therefore, requires device execution to finish. This can result in non-overlapped execution between host and device leading to sub-optimal performance.
To reduce loss calculation or progress bar update, set the print frequency --print-freq
to a high value
or eliminate it altogether. You can set the --print-freq
variable in the model run command to a size similar
to the optimizer step size. For the progress bar, it is recommended to wait until a run completes 20 or more iterations
to minimize unnecessary synchronization.
Pinning Memory For Dataloader¶
Pinning the memory while instantiating the dataloader avoids a redundant copy in host during the training iteration. Refer to support PyTorch Dataloader for more details.
Example:
Refer to ResNet50 Dataloader.
Avoiding Constant Variables in Loops¶
Avoiding loop iterator variables within a loop may reduce the recompilations occurrences in consecutive iterations. This loop iterator variable can create different constant operators in the execution graph each time the loop is executed.
For example, in the original V-Diffusion code the value of the iterator variable changes each time the loop iterates.
To avoid triggering recompilations after each iteration, the loop iterator variable i
is not used in
the Intel Gaudi V-Diffusion model. See the example below:
for i in range(4, num_steps):
# The following 3 lines remove graph recompilation (variable "i" is not used)
t_1 = steps[0] # before: steps[i]
t_2 = steps[1] # before: steps[i+1]
steps = torch.roll(steps, shifts=(-1), dims=(0))
Example:
Refer to the implementation for the Intel Gaudi V-Diffusion model and compare it with the original V-Diffusion code.
Weight Sharing¶
Weight sharing is a technique in which the module weights are shared among two or more layers. Weights can be shared using PyTorch with Gaudi only if they are created inside the module. See the example below:
import torch
import habana_frameworks.torch.core as ht
# Example module
class WeightShareModule(torch.nn.Module):
def __init__(self):
super(WeightShareModule, self).__init__()
self.a = torch.nn.Parameter(torch.ones([2]))
self.b = torch.nn.Parameter(torch.ones([2]))
def forward(self, input):
c = self.a*input + self.b*input
return c
module = WeightShareModule()
#module.a and module.b are shared
module.a = module.b
# Move the module to HPU device
module.to("hpu")
Example:
Refer to BERT Pre-Training on GitHub.
Switch Host Memory Allocator¶
For deep learning workloads, jemalloc or TCMalloc achieve better performance by reusing memory as much as possible. Both Jemalloc and TCMalloc are pre-installed using the Intel Gaudi dockers.
Jemalloc - A general purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support.
TCMalloc - Features optimizations to speed up program executions including holding memory in caches to speed up access of commonly-used objects. Holding such caches even after deallocation also helps avoid costly system calls if such memory is later re-allocated.
By default, HPU uses TCMalloc allocator for host memory. For some workloads, this can cause host Out-Of-Memory issues as it holds memory
in cache. This can be mitigated adjusting TCMalloc cache size via its config or switching to the jemalloc allocator using the LD_PRELOAD
environment variable.
To switch to the jemalloc allocator:
Clear the existing allocator from LD_PRELOAD
export LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
Optimizing Training Using PyTorch Lightning¶
HPUParallelStrategy provided by PyTorch Lightning package supports the following features:
Setting size of gradient bucket
Setting gradients view of allreduce buckets
static_graph
By setting static_graph when instantiating the Trainer, allreduce on unused parameters in the graph can be avoided. This also bypasses the overhead of copying them from host to device and vice versa after performing allreduce.
Example:
Refer to Unet2D for the implementation.