2. PyTorch User Guide

2.1. Introduction

The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana® Gaudi® infrastructure. It provides guidelines for modifying existing models to run on the platform and uses a basic example to show functionality. Currently, support for PyTorch is under active development and available only in Beta.


Please make sure that the version of the SynapseAI software stack installation matches the version of the Docker images you are using. Our documentation on docs.habana.ai is also versioned, so select the appropriate version. The Setup and Install GitHub repository as well as the Model-References GitHub repository have branches for each release version. Make sure you are selecting the branch that matches the version of your SynapseAI software installation. For example, if SynapseAI software version 0.15.4 is installed, then you would clone the Model-References repository like this: % git clone -b 0.15.4 https://github.com/HabanaAI/Model-References. To confirm the SynapseAI Software version on your build, run the hl-smi tool and look at the “Driver Version”. (see the figure below)


Figure 2.1 SynapseAI Version Check

2.2. PyTorch Gaudi Integration Architecture

The PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to drive the execution of deep learning models on the Habana Gaudi device. The installation package provided by Habana comes with modifications on top of the PyTorch release. The customized framework from this installed package needs to be used to integrate PyTorch with the Habana bridge. PyTorch deep learning model training scripts need to load the PyTorch Habana plugin library and import habana_frameworks.torch.core module to integrate with Habana Bridge.

Further integration details can be found at PyTorch Examples and Porting a PyTorch Model to Gaudi.

The Habana bridge supports various modes of execution for a PyTorch D model. The following modes are supported:

  • Eager mode - op-by-op execution as defined in standard PyTorch eager mode scripts.

  • Lazy mode - deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi.

2.2.1. Eager Mode

During eager mode execution, the framework executes one op at a time from python. The Habana bridge registers these ops for Habana device and drives the execution on Gaudi. For any op that is not supported by the Habana device, the bridge falls back to CPU to execute the op and continue execution on the device for further supported ops thereafter.

2.2.2. Graph Mode

For TorchScript mode, users need to make changes in the python script to create a TorchScript model using torch.jit.trace as required by the PyTorch framework. In this mode, Habana launches the TorchScript JIT subgraph with a single SynapseAI graph for execution to achieve better compute performance compared to eager mode.


Support for graph mode is minimal and may not be available in future releases.

2.2.3. Lazy Mode

With this mode, users retain the flexibility and benefits that come with the PyTorch define-by-run approach of the Eager mode. The Habana bridge internally accumulates these ops in a graph. The execution of the ops in the accumulated graph is triggered in a lazy manner, only when a tensor value is required by the user. This allows the bridge to construct a SynapseAI graph with multiple ops, which provides the SynapseAI graph compiler the opportunity to optimize the device execution for these ops. The figure below shows the architectural diagram for the PyTorch Habana full stack.


Figure 2.2 PyTorch Habana Full Stack Architecture

2.2.4. PyTorch Habana Bridge

This section describes the major components of the PyTorch Habana Bridge. Synapse Lowering Module

The SynapseAI Lowering module converts the framework provided op or graph to SynapseAI. The PyTorch framework dispatches the execution to the registered methods in the Habana bridge when an op is invoked on Habana tensors. In lazy evaluation mode, the Habana bridge internally builds a graph with accumulated ops. Once a tensor is required to be evaluated, the associated graph that needs to be executed is identified for the resulting tensor. Various optimization passes are applied to the graph, such as:

  • fusion of ops that are beneficial for Gaudi,

  • optimal placement of permutations for channel last memory format of tensors,

  • identification of persistent, non-persistent and tensors with duplicate or aliased memory. PyTorch Kernels

PyTorch Kernel module within the Habana bridge provides the functionality to convert a PyTorch op into appropriate SynapseAI ops. The PyTorch op could be implemented with a single or multiple TPC/MME ops. The PyTorch Kernel module adds these set of SynapseAI ops to the SynapseAI graph and converts the PyTorch tensors to SynapseAI tensors for building the SynapseAI graph. Execution Module

The Execution Module in the Habana bridge provides the functionality to compile a SynapseAI graph and launch the resulting recipe in an asynchronous method. The recipes are cached by the Habana bridge to avoid recompilation of the same graph. This caching is done at an eager op level as well as at a JIT graph level. During training, the graph compilation is only required for the initial iteration, thereafter, the same compiled recipe is re-executed every iteration (with new inputs) unless there is a change in the ops being executed. Memory Manager

The Habana bridge has a memory manager that optimally serves the allocation and free requests from the device memory. The Habana bridge additionally provides the capability to create tensors with pinned memory, which reduces the time required for doing a DMA by avoiding a copy on the host side. The pinned memory feature can be expressed on a tensor with existing flags provided by the PyTorch framework. Mixed Precision Support

Habana PyTorch supports mixed precision execution using the Habana Mixed Precision (HMP) package. You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution. Refer to PyTorch Mixed Precision Training on Gaudi for further details. Distributed Training

The collective ops are implemented using the Habana Communication Library (HCL), which is used to perform communication among different Gaudi cards (see Habana Communication Library (HCL) API Reference). Habana integrates the DistributedDataParallel (DDP) package for providing distributed training support on PyTorch. Habana is integrated with the PyTorch framework to drive distributed communications for Habana tensors over HCL implementation. The

DDP is integrated further with Eager, Graph as well as Lazy evaluation execution modes.

2.3. PyTorch Mixed Precision Training on Gaudi

Habana Mixed Precision (HMP) package is a tool that allows you to run mixed precision training on HPU without extensive modifications to existing FP32 model scripts. You can easily add mixed precision training support to the model script by adding the following lines anywhere in the script before the start of the training loop:

from habana_frameworks.torch.hpex import hmp

Any segment of script (e.g. optimizer) in which you want to avoid using mixed precision should be kept under the following Python context:

from habana_frameworks.torch.hpex import hmp
with hmp.disable_casts():
  code line:1
  code line:2

2.3.1. Basic Design Rules

  • Two different lists are maintained: (i) OPs that always run in BF16 only, (ii) OPs that always run in FP32 only.

  • Python decorators are used to add required functionality (bf16 or fp32 casts on OP input(s)) to torch functions (refer to code snippet below).

  • Any OPs not in the above two lists will run with precision type of its 1st input (except for exceptions listed below).

  • For OPs with multiple tensor inputs (maintained in a separate list, e.g. add, sub, cat, stack etc.), cast all inputs to the widest precision type among all input precision types. If any of these OPs are in BF16 or FP32 list, that list has a higher precedence.

  • For in-place OPs (output & 1st input share storage), cast all inputs to precision type of 1st input.

from functools import wraps
def op_wrap(op, cast_fn):
"""Adds wrapper function to OPs. All tensor inputs
for the OP are casted to type determined by cast_fn

op (torch.nn.functional/torch/torch.Tensor): Input OP
cast_fn (to_bf16/to_fp32): Fn to cast input tensors

Wrapper function that shall be inserted back to
corresponding module for this OP.
def wrapper(*args, **kwds):
    args_cast = get_new_args(cast_fn, args, kwds)
    return op(*args_cast, **kwds)

return wrapper

2.3.2. Configuration Options

HMP provides two modes (opt_level = O1/O2) of mixed precision training to choose from. These modes can be chosen by passing opt_level= as an argument to hmp.convert().

O1 is the default and recommended mode of operation when using HMP. O2 can be used for debugging convergence issues as well as for initial iterations of converting a new model to run with mixed precision. Opt_level = O1

In this mode, OPs that always run in BF16 and OPs that always run in FP32 are selected from a BF16 list and FP32 list respectively. BF16 list contains OPs that are numerically safe to run in lower precision on HPU, whereas FP32 list contains OPs that should be run in higher precision (conservative choice that works across models).

  • Default BF16 list = [addmm, bmm, conv1d, conv2d, conv3d, dot, mm, mv]

  • Default FP32 list = [batch_norm, cross_entropy, log_softmax, softmax, nll_loss, topk]

HMP provides the option of overriding these internal lists, allowing you to provide your own BF16 and FP32 lists (pass bf16_file_path=<.txt> and fp32_file_path=<.txt> as arguments to hmp.convert()). This is particularly useful when customizing mixed precision training for a particular model. For example:

  • Custom BF16 list for Resnet50 = [ addmm, avg_pool2d, bmm, conv2d, dot, max_pool2d, mm, mv, relu, t, linear]

  • Custom FP32 list for Resnet50 = [cross_entropy, log_softmax, softmax, nll_loss, topk] Opt_level = O2

In this mode, only GEMM and Convolution type OPs (e.g. conv1d, conv2d, conv3d, addmm, mm, bmm, mv, dot) should run in BF16 and all other OPs should run in FP32.

2.3.3. Usage Examples

import torch
from habana_frameworks.torch.hpex import hmp

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device=“habana”)
y = torch.randn(N, D_out, device=“habana”)

# enable mixed precision training with optimization level O1, default BF16 list, default FP32 list and logging disabled
# use opt_level to select desired mode of operation
# use bf16_file_path to provide absolute path to a file with custom BF16 list
# use fp32_file_path to provide absolute path to a file with custom FP32 list
# use isVerbose to disable/enable debug logs
hmp.convert(opt_level="O1", bf16_file_path="", fp32_file_path="", isVerbose=False)
model = torch.nn.Linear(D_in, D_out).to(torch.device(“habana”))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for t in range(500):
   y_pred = model(x)
   loss = torch.nn.functional.mse_loss(y_pred, y)

   # disable mixed precision for optimizer block
   with hmp.disable_casts():

2.3.4. Debugging HMP Logs

HMP provides the ability to log precision decisions for each OP for debugging purposes. You can enable verbose logs by passing isVerbose = True as an argument to hmp.convert(). The log prints the precision type each time an OP (covered by a Python decorator) is called in the model being run. See the example below:

casting  <method '__mul__' of 'torch._C._TensorBase' objects>  to  to_fp32
casting  <built-in method embedding of type object at 0x7feab47edfa0> to_fp32
casting  <function layer_norm at 0x7feaab2a4320> to_bf16
casting  <function dropout at 0x7feaab2a2e60> to_bf16
casting  <method 'matmul' of 'torch._C._TensorBase' objects> to_bf16
casting  <method '__iadd__' of 'torch._C._TensorBase' objects>  to  to_bf16 Visualizing Torch Graph

Use torchviz (https://github.com/szagoruyko/pytorchviz) to visualize the model graph and check if cast nodes are inserted at expected positions to convert portions of the model to BF16.


Cast nodes show up as “CopyBackwards” in the graph.

2.4. PyTorch Examples

This section describes how to train models using Habana PyTorch with Gaudi.

2.4.1. Run Models in Habana Model Repository

  1. Make sure the drivers and firmware required for Gaudi are installed. See Installation Guide. For Docker setup and installation details, refer to the Setup and Install Github page.

  2. Clone the models located in Model References GitHub page using Git clone.

  3. Launch runs on Gaudi using the README instructions located in PyTorch Models on GitHub.

2.5. Porting a PyTorch Model to Gaudi

To port your own models on Gaudi, refer to the Porting a Simple PyTorch Model to Gaudi section located in the Migration Guide.

2.6. Host and Device Ops Placement

When the model is ported to run on HPU, the software stack decides which ops are placed on CPU and which are placed on the HPU. This decision is based on whether the op is registered with PyTorch with HPU as the backend. Execution of an op automatically falls back to CPU if the op is not registered with its backend as HPU.

To enable CPU fallback logs to check whether op execution fell back to CPU, set the environment variables as shown below:


For example, you will see logs as shown below if execution of ops tril and lgamma_ fall back to the CPU:

CPU fallback tril : self=HABANAHalfType
CPU fallback lgamma_ : self=HABANAFloatType

2.7. Runtime Flags

The following table describes runtime flags that are set in the environment to change the behavior as well as enable or disable some features.







Enables Lazy Execution mode.

Habana PyTorch Bridge Modules



Adds HCL_Sync to synchronize the hosts before collective calls

Habana PyTorch Bridge


All Modules

A Bitmask specifying Habana PyTorch Bridge module to enable logging.

  • 0x1 - Device logs

  • 0x2 - PT kernel/ops logs

  • 0x4 - Bridge logs

  • 0x8 - Synapse helper logs

  • 0x10 - Distributed module logs

  • 0x20 - Lazy mode logs

  • 0x40 - Pinned memory logs

  • 0x80 - CPU fallback logs

Habana PyTorch Bridge




A Bitmask specifying Habana PyTorch Bridge logging level from SynapseAI and perf_lib.

  • FATAL = 1

  • WARNING = 2

  • TRACE = 4

  • DEBUG = 8

Habana PyTorch Bridge Modules



If set to true, enables printing SynapseAI logs to the console.




Logging level from SynapseAI and perf_lib.

  • 6 is no logs

  • 0 is verbose

By default, logs are placed either in the console (if ENABLE_CONSOLE=true) or under ~/.habana_logs/.




Creates of graph visualization files. The output dump graphs are in ./.graph_dumps folder


2.8. Python Package (habana_frameworks.torch)

This package provides PyTorch bridge interfaces and modules such as optimizers, mixed precision configuration, fused kernels for training on HPU and so on.

The various modules are organized as listed in the below example:


The following sections provided a brief description of each module.

2.8.1. core

core module provides Python bindings to PyTorch-Habana bridge interfaces. For example, mark_step which is used to trigger execution of accumulated graphs in Lazy mode.

2.8.2. hpex/hmp

hpex/hmp module contains the habana_mixed_precision (hmp) tool which can be used to train a model in mixed precision on HPU. Refer to PyTorch Mixed Precision Training on Gaudi for further details.

2.8.3. hpex/kernels

hpex/kernels module contains Python interfaces to Habana only custom operators, such as EmbeddingBag and EmbeddingBagPreProc operators which are used in Habana DLRM model.

2.8.4. hpex/normalization

hpex/normalization module contains Python interfaces to the Habana implementation for common normalize & clip operations performed on gradients in some models. Usage of Habana provided implementation can provide better performance (compared to equivalent operator provided in torch). Refer to Other Custom OPs for further details.

2.8.5. hpex/optimizers

hpex/optimizers contains Python interfaces to Habana implementation for some of the common optimizers used in DL models. Usage of Habana implementation can provide better performance (compared to corresponding optimizer implementations available in torch). Refer to Custom Optimizers for further details.

2.8.6. utils

utils module contains general Python utilities required for training on HPU, such as load_habana_module which is used to load Habana libraries required for PyTorch to register HPU as one of the available devices.