2. PyTorch User Guide¶
The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana® Gaudi® infrastructure. It provides guidelines for modifying existing models to run on the platform and uses a basic example to show functionality. Currently, support for PyTorch is under active development and available only in Beta.
Please make sure that the version of the SynapseAI software stack installation matches the version of the Docker images you are using. Our documentation on docs.habana.ai is also versioned,
so select the appropriate version.
The Setup and Install GitHub repository as well as the Model-References GitHub repository have branches for each release version.
Make sure you are selecting the branch that matches the version of your SynapseAI software installation.
For example, if SynapseAI software version 0.15.4 is installed,
then you would clone the Model-References repository like this:
% git clone -b 0.15.4 https://github.com/HabanaAI/Model-References.
To confirm the SynapseAI Software version on your build, run the
hl-smi tool and look at the “Driver Version”. (see the figure below)
2.2. PyTorch Gaudi Integration Architecture¶
The PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to drive the execution of deep learning models on the Habana Gaudi device.
The installation package provided by Habana comes with modifications on top of the PyTorch release.
The customized framework from this installed package needs to be used to integrate PyTorch with the Habana bridge.
PyTorch deep learning model training scripts need to load the PyTorch Habana plugin library and
import habana_frameworks.torch.core module to integrate with Habana Bridge.
The Habana bridge supports various modes of execution for a PyTorch D model. The following modes are supported:
Eager mode - op-by-op execution as defined in standard PyTorch eager mode scripts.
Lazy mode - deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi.
2.2.1. Eager Mode¶
During eager mode execution, the framework executes one op at a time from python. The Habana bridge registers these ops for Habana device and drives the execution on Gaudi. For any op that is not supported by the Habana device, the bridge falls back to CPU to execute the op and continue execution on the device for further supported ops thereafter.
2.2.2. Graph Mode¶
For TorchScript mode, users need to make changes in the python script to create a TorchScript model using torch.jit.trace as required by the PyTorch framework. In this mode, Habana launches the TorchScript JIT subgraph with a single SynapseAI graph for execution to achieve better compute performance compared to eager mode.
Support for graph mode is minimal and may not be available in future releases.
2.2.3. Lazy Mode¶
With this mode, users retain the flexibility and benefits that come with the PyTorch define-by-run approach of the Eager mode. The Habana bridge internally accumulates these ops in a graph. The execution of the ops in the accumulated graph is triggered in a lazy manner, only when a tensor value is required by the user. This allows the bridge to construct a SynapseAI graph with multiple ops, which provides the SynapseAI graph compiler the opportunity to optimize the device execution for these ops. The figure below shows the architectural diagram for the PyTorch Habana full stack.
2.2.4. PyTorch Habana Bridge¶
This section describes the major components of the PyTorch Habana Bridge.
188.8.131.52. Synapse Lowering Module¶
The SynapseAI Lowering module converts the framework provided op or graph to SynapseAI. The PyTorch framework dispatches the execution to the registered methods in the Habana bridge when an op is invoked on Habana tensors. In lazy evaluation mode, the Habana bridge internally builds a graph with accumulated ops. Once a tensor is required to be evaluated, the associated graph that needs to be executed is identified for the resulting tensor. Various optimization passes are applied to the graph, such as:
fusion of ops that are beneficial for Gaudi,
optimal placement of permutations for channel last memory format of tensors,
identification of persistent, non-persistent and tensors with duplicate or aliased memory.
184.108.40.206. PyTorch Kernels¶
PyTorch Kernel module within the Habana bridge provides the functionality to convert a PyTorch op into appropriate SynapseAI ops. The PyTorch op could be implemented with a single or multiple TPC/MME ops. The PyTorch Kernel module adds these set of SynapseAI ops to the SynapseAI graph and converts the PyTorch tensors to SynapseAI tensors for building the SynapseAI graph.
220.127.116.11. Execution Module¶
The Execution Module in the Habana bridge provides the functionality to compile a SynapseAI graph and launch the resulting recipe in an asynchronous method. The recipes are cached by the Habana bridge to avoid recompilation of the same graph. This caching is done at an eager op level as well as at a JIT graph level. During training, the graph compilation is only required for the initial iteration, thereafter, the same compiled recipe is re-executed every iteration (with new inputs) unless there is a change in the ops being executed.
18.104.22.168. Memory Manager¶
The Habana bridge has a memory manager that optimally serves the allocation and free requests from the device memory. The Habana bridge additionally provides the capability to create tensors with pinned memory, which reduces the time required for doing a DMA by avoiding a copy on the host side. The pinned memory feature can be expressed on a tensor with existing flags provided by the PyTorch framework.
22.214.171.124. Mixed Precision Support¶
Habana PyTorch supports mixed precision execution using the Habana Mixed Precision (HMP) package. You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution. Refer to PyTorch Mixed Precision Training on Gaudi for further details.
126.96.36.199. Distributed Training¶
The collective ops are implemented using the Habana Communication Library (HCL), which is used to perform communication among different Gaudi cards (see Habana Communication Library (HCL) API Reference). Habana integrates the DistributedDataParallel (DDP) package for providing distributed training support on PyTorch. Habana is integrated with the PyTorch framework to drive distributed communications for Habana tensors over HCL implementation. The
DDP is integrated further with Eager, Graph as well as Lazy evaluation execution modes.
2.3. PyTorch Mixed Precision Training on Gaudi¶
Habana Mixed Precision (HMP) package is a tool that allows you to run mixed precision training on HPU without extensive modifications to existing FP32 model scripts. You can easily add mixed precision training support to the model script by adding the following lines anywhere in the script before the start of the training loop:
from habana_frameworks.torch.hpex import hmp hmp.convert()
Any segment of script (e.g. optimizer) in which you want to avoid using mixed precision should be kept under the following Python context:
from habana_frameworks.torch.hpex import hmp with hmp.disable_casts(): code line:1 code line:2
2.3.1. Basic Design Rules¶
Two different lists are maintained: (i) OPs that always run in BF16 only, (ii) OPs that always run in FP32 only.
Python decorators are used to add required functionality (bf16 or fp32 casts on OP input(s)) to torch functions (refer to code snippet below).
Any OPs not in the above two lists will run with precision type of its 1st input (except for exceptions listed below).
For OPs with multiple tensor inputs (maintained in a separate list, e.g. add, sub, cat, stack etc.), cast all inputs to the widest precision type among all input precision types. If any of these OPs are in BF16 or FP32 list, that list has a higher precedence.
For in-place OPs (output & 1st input share storage), cast all inputs to precision type of 1st input.
from functools import wraps def op_wrap(op, cast_fn): """Adds wrapper function to OPs. All tensor inputs for the OP are casted to type determined by cast_fn provided. Args: op (torch.nn.functional/torch/torch.Tensor): Input OP cast_fn (to_bf16/to_fp32): Fn to cast input tensors Returns: Wrapper function that shall be inserted back to corresponding module for this OP. """ @wraps(op) def wrapper(*args, **kwds): args_cast = get_new_args(cast_fn, args, kwds) return op(*args_cast, **kwds) return wrapper
2.3.2. Configuration Options¶
HMP provides two modes (opt_level = O1/O2) of mixed precision training to choose from. These modes can be chosen
opt_level= as an argument to
O1 is the default and recommended mode of operation when using HMP. O2 can be used for debugging convergence issues as well as for initial iterations of converting a new model to run with mixed precision.
188.8.131.52. Opt_level = O1¶
In this mode, OPs that always run in BF16 and OPs that always run in FP32 are selected from a BF16 list and FP32 list respectively. BF16 list contains OPs that are numerically safe to run in lower precision on HPU, whereas FP32 list contains OPs that should be run in higher precision (conservative choice that works across models).
Default BF16 list = [addmm, bmm, conv1d, conv2d, conv3d, dot, mm, mv]
Default FP32 list = [batch_norm, cross_entropy, log_softmax, softmax, nll_loss, topk]
HMP provides the option of overriding these internal lists, allowing you to provide your own BF16 and FP32 lists (pass
fp32_file_path=<.txt> as arguments to
hmp.convert()). This is particularly useful when
customizing mixed precision training for a particular model. For example:
Custom BF16 list for Resnet50 = [ addmm, avg_pool2d, bmm, conv2d, dot, max_pool2d, mm, mv, relu, t, linear]
Custom FP32 list for Resnet50 = [cross_entropy, log_softmax, softmax, nll_loss, topk]
184.108.40.206. Opt_level = O2¶
In this mode, only GEMM and Convolution type OPs (e.g. conv1d, conv2d, conv3d, addmm, mm, bmm, mv, dot) should run in BF16 and all other OPs should run in FP32.
2.3.3. Usage Examples¶
import torch from habana_frameworks.torch.hpex import hmp N, D_in, D_out = 64, 1024, 512 x = torch.randn(N, D_in, device=“habana”) y = torch.randn(N, D_out, device=“habana”) # enable mixed precision training with optimization level O1, default BF16 list, default FP32 list and logging disabled # use opt_level to select desired mode of operation # use bf16_file_path to provide absolute path to a file with custom BF16 list # use fp32_file_path to provide absolute path to a file with custom FP32 list # use isVerbose to disable/enable debug logs hmp.convert(opt_level="O1", bf16_file_path="", fp32_file_path="", isVerbose=False) model = torch.nn.Linear(D_in, D_out).to(torch.device(“habana”)) optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) for t in range(500): y_pred = model(x) loss = torch.nn.functional.mse_loss(y_pred, y) optimizer.zero_grad() loss.backward() # disable mixed precision for optimizer block with hmp.disable_casts(): optimizer.step()
220.127.116.11. HMP Logs¶
HMP provides the ability to log precision decisions for each OP for debugging purposes. You can enable verbose logs by passing
isVerbose = True as an argument to
hmp.convert(). The log prints the precision type each time an OP (covered by a
Python decorator) is called in the model being run. See the example below:
casting <method '__mul__' of 'torch._C._TensorBase' objects> to to_fp32 casting <built-in method embedding of type object at 0x7feab47edfa0> to_fp32 casting <function layer_norm at 0x7feaab2a4320> to_bf16 casting <function dropout at 0x7feaab2a2e60> to_bf16 casting <method 'matmul' of 'torch._C._TensorBase' objects> to_bf16 casting <method '__iadd__' of 'torch._C._TensorBase' objects> to to_bf16
2.4. PyTorch Examples¶
This section describes how to train models using Habana PyTorch with Gaudi.
2.4.1. Run Models in Habana Model Repository¶
Clone the models located in Model References GitHub page using Git clone.
Launch runs on Gaudi using the README instructions located in PyTorch Models on GitHub.
2.5. Porting a PyTorch Model to Gaudi¶
2.6. Host and Device Ops Placement¶
When the model is ported to run on HPU, the software stack decides which ops are placed on CPU and which are placed on the HPU. This decision is based on whether the op is registered with PyTorch with HPU as the backend. Execution of an op automatically falls back to CPU if the op is not registered with its backend as HPU.
To enable CPU fallback logs to check whether op execution fell back to CPU, set the environment variables as shown below:
For example, you will see logs as shown below if execution of ops
lgamma_ fall back to the CPU:
CPU fallback tril : self=HABANAHalfType CPU fallback lgamma_ : self=HABANAFloatType
2.7. Runtime Flags¶
The following table describes runtime flags that are set in the environment to change the behavior as well as enable or disable some features.
Enables Lazy Execution mode.
Habana PyTorch Bridge Modules
Adds HCL_Sync to synchronize the hosts before collective calls
Habana PyTorch Bridge
A Bitmask specifying Habana PyTorch Bridge module to enable logging.
Habana PyTorch Bridge
A Bitmask specifying Habana PyTorch Bridge logging level from SynapseAI and perf_lib.
Habana PyTorch Bridge Modules
If set to
Logging level from SynapseAI and perf_lib.
By default, logs are placed either in the
Creates of graph visualization files. The output dump graphs are in ./.graph_dumps folder
2.8. Python Package (habana_frameworks.torch)¶
This package provides PyTorch bridge interfaces and modules such as optimizers, mixed precision configuration, fused kernels for training on HPU and so on.
The various modules are organized as listed in the below example:
habana_frameworks.torch core hpex hmp kernels normalization optimizers utils
The following sections provided a brief description of each module.
core module provides Python bindings to PyTorch-Habana bridge interfaces.
mark_step which is used to trigger execution of accumulated graphs in Lazy mode.
hpex/hmp module contains the
habana_mixed_precision (hmp) tool which can be used to train a model in mixed
precision on HPU. Refer to PyTorch Mixed Precision Training on Gaudi for further details.
hpex/kernels module contains Python interfaces to Habana only custom operators, such as EmbeddingBag and EmbeddingBagPreProc
operators which are used in Habana DLRM model.
hpex/normalization module contains Python interfaces to the Habana implementation for common normalize & clip operations performed
on gradients in some models. Usage of Habana provided implementation can provide better performance (compared to
equivalent operator provided in torch). Refer to Other Custom OPs for further details.
hpex/optimizers contains Python interfaces to Habana implementation for some of the common optimizers used in DL models.
Usage of Habana implementation can provide better performance (compared to corresponding optimizer implementations
available in torch). Refer to Custom Optimizers for further details.
utils module contains general Python utilities required for training on HPU, such as
load_habana_module which is used to load
Habana libraries required for PyTorch to register HPU as one of the available devices.