2. PyTorch User Guide

2.1. Introduction

The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana® Gaudi® infrastructure. It provides guidelines for modifying existing models to run on the platform and uses a basic example to show functionality.

2.2. PyTorch Gaudi Integration Architecture

The PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to drive the execution of deep learning models on the Habana Gaudi device. The installation package provided by Habana comes with modifications on top of the PyTorch release. The customized framework from this installed package needs to be used to integrate PyTorch with the Habana bridge. PyTorch deep learning model training scripts need to load the PyTorch Habana plugin library and import habana_frameworks.torch.core module to integrate with Habana Bridge.

Further integration details can be found at PyTorch Examples and Porting a PyTorch Model to Gaudi.

The Habana bridge supports various modes of execution for a PyTorch model. The following modes are supported:

  • Eager mode - op-by-op execution as defined in standard PyTorch eager mode scripts.

  • Lazy mode - deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi.

2.2.1. Eager Mode

During eager mode execution, the framework executes one op at a time from python. The Habana bridge registers these ops for Habana device and drives the execution on Gaudi. For any op that is not supported by the Habana device, the bridge falls back to CPU to execute the op and continue execution on the device for further supported ops thereafter.

2.2.2. Lazy Mode

With this mode, users retain the flexibility and benefits that come with the PyTorch define-by-run approach of the Eager mode. The Habana bridge internally accumulates these ops in a graph. The execution of the ops in the accumulated graph is triggered in a lazy manner, only when a tensor value is required by the user. This allows the bridge to construct a SynapseAI graph with multiple ops, which provides the SynapseAI graph compiler the opportunity to optimize the device execution for these ops. The figure below shows the architectural diagram for the PyTorch Habana full stack.


Figure 2.1 PyTorch Habana Full Stack Architecture

2.2.3. PyTorch Habana Bridge

This section describes the major components of the PyTorch Habana Bridge. Synapse Lowering Module

The SynapseAI Lowering module converts the framework provided op or graph to SynapseAI. The PyTorch framework dispatches the execution to the registered methods in the Habana bridge when an op is invoked on Habana tensors. In lazy evaluation mode, the Habana bridge internally builds a graph with accumulated ops. Once a tensor is required to be evaluated, the associated graph that needs to be executed is identified for the resulting tensor. Various optimization passes are applied to the graph, such as:

  • fusion of ops that are beneficial for Gaudi,

  • optimal placement of permutations for channel last memory format of tensors,

  • identification of persistent, non-persistent and tensors with duplicate or aliased memory. PyTorch Kernels

PyTorch Kernel module within the Habana bridge provides the functionality to convert a PyTorch op into appropriate SynapseAI ops. The PyTorch op could be implemented with a single or multiple TPC/MME ops. The PyTorch Kernel module adds these set of SynapseAI ops to the SynapseAI graph and converts the PyTorch tensors to SynapseAI tensors for building the SynapseAI graph. Execution Module

The Execution Module in the Habana bridge provides the functionality to compile a SynapseAI graph and launch the resulting recipe in an asynchronous method. The recipes are cached by the Habana bridge to avoid recompilation of the same graph. This caching is done at an eager op level as well as at a JIT graph level. During training, the graph compilation is only required for the initial iteration, thereafter, the same compiled recipe is re-executed every iteration (with new inputs) unless there is a change in the ops being executed. Memory Manager

The Habana bridge has a memory manager that optimally serves the allocation and free requests from the device memory. The Habana bridge additionally provides the capability to create tensors with pinned memory, which reduces the time required for doing a DMA by avoiding a copy on the host side. The pinned memory feature can be expressed on a tensor with existing flags provided by the PyTorch framework. Mixed Precision Support

Habana PyTorch supports mixed precision execution using the Habana Mixed Precision (HMP) package. You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution. Refer to PyTorch Mixed Precision Training on Gaudi for further details. Distributed Training

Habana PyTorch implements HCCL communication backend to support scale-up and scale-out. Habana Collective Communication Library (HCCL)

Collective ops are implemented using the Habana Collective Communications Library (HCCL) and used to perform communication among different Gaudi cards (see Habana Collective Communications Library (HCCL) API Reference). HCCL is integrated with torch.distributed as a communication backend. This is used to enable DistributedDataParallel (DDP) and the different PyTorch distributed collectives.

HCCL supports scale-up using RDMA and scale-out via RDMA or HOST NIC. It is enabled by default for all the supported topologies.

Scale Out Topology

Distributed Framework




Gaudi based

torch Distributed

RDMA based scaling


Scale Up using RDMA Scale Out using RDMA

Host NIC Scaling

torch Distributed

TCP/IP based scaling

HCCL_OVER_TCP=1 (see Configuration Knobs)

Scale Up using RDMA Scale Out using HCCL based host NIC

Host NIC Scaling

torch Distributed

Libfabric based scaling


see(Scale-Out via Host-NIC over OFI)

Scale Up using RDMA Scale Out using HCCL based host NIC

The PyTorch HPU integration also provides user-friendly tools for hardware monitoring, performance profiling and error debugging. For these tools, refer to the Profiler User Guide. Habana Data Loader

Habana data loader is a CPU based accelerated data loader for Imagenet dataset. It inherits the native torch.utils.data.DataLoader and maintains the same interface from the user perspective. Internally, Habana data loader falls back to the native torch data loader if the provided parameters are not supported.

The data loader is imported and used similar to the torch DataLoader. For example:

import habana_dataloader
    dataset, batch_size=args.batch_size, sampler=train_sampler,
    num_workers=args.workers, pin_memory=True, drop_last=True)

See ResNet Model References GitHub page for the full example. Fallback

When the provided input parameters are not eligible for CPU acceleration (see Current Limitations), the native torch data loader is initialized and used. In such a case, the following message will be printed:

Failed to initialize Habana Dataloader, error: {error message}
Running with PyTorch Dataloader Current Limitations
  • Acceleration takes place only with the following parameters (Resnet based configuration):

    • shuffle=False

    • batch_sampler=None

    • num_workers=8

    • collate_fn=None

    • pin_memory=True

    • timeout=0

    • worker_init_fn=None

    • multiprocessing_context=None

    • generator=None

    • prefetch_factor=2

    • persistent_workers=False

    • dataset is torchvision.datasets.ImageFolder

  • The dataset should contain only .jpg or .jpeg files.

  • Acceleration can take place only with the following dataset torchvision transforms (packed as transforms.Compose):

    • RandomResizesCrop

    • CenterCrop

    • Resize

    • ToTensor

    • RandomHorizontalFlip, only with p=0.5

    • Normalize, only with mean=[0.485, 0.456, 406], std=[0.229, 0.224, 0.225]

2.3. PyTorch Mixed Precision Training on Gaudi

Habana Mixed Precision (HMP) package is a tool that allows you to run mixed precision training on HPU without extensive modifications to existing FP32 model scripts. You can easily add mixed precision training support to the model script by adding the following lines anywhere in the script before the start of the training loop:

from habana_frameworks.torch.hpex import hmp

Any segment of script (e.g. optimizer) in which you want to avoid using mixed precision should be kept under the following Python context:

from habana_frameworks.torch.hpex import hmp
with hmp.disable_casts():
  code line:1
  code line:2

2.3.1. Basic Design Rules

  • Two different lists are maintained: (i) OPs that always run in BF16 only, (ii) OPs that always run in FP32 only.

  • Python decorators are used to add required functionality (bf16 or fp32 casts on OP input(s)) to torch functions (refer to code snippet below).

  • Any OPs not in the above two lists will run with precision type of its 1st input (except for exceptions listed below).

  • For OPs with multiple tensor inputs (maintained in a separate list, e.g. add, sub, cat, stack etc.), cast all inputs to the widest precision type among all input precision types. If any of these OPs are in BF16 or FP32 list, that list has a higher precedence.

  • For in-place OPs (output & 1st input share storage), cast all inputs to precision type of 1st input.

from functools import wraps
def op_wrap(op, cast_fn):
"""Adds wrapper function to OPs. All tensor inputs
for the OP are casted to type determined by cast_fn

op (torch.nn.functional/torch/torch.Tensor): Input OP
cast_fn (to_bf16/to_fp32): Fn to cast input tensors

Wrapper function that shall be inserted back to
corresponding module for this OP.
def wrapper(*args, **kwds):
    args_cast = get_new_args(cast_fn, args, kwds)
    return op(*args_cast, **kwds)

return wrapper

2.3.2. Configuration Options

HMP provides two modes (opt_level = O1/O2) of mixed precision training to choose from. These modes can be chosen by passing opt_level= as an argument to hmp.convert().

O1 is the default and recommended mode of operation when using HMP. O2 can be used for debugging convergence issues as well as for initial iterations of converting a new model to run with mixed precision. Opt_level = O1

In this mode, OPs that always run in BF16 and OPs that always run in FP32 are selected from a BF16 list and FP32 list respectively. BF16 list contains OPs that are numerically safe to run in lower precision on HPU, whereas FP32 list contains OPs that should be run in higher precision (conservative choice that works across models).

  • Default BF16 list = [addmm, bmm, conv1d, conv2d, conv3d, dot, mm, mv]

  • Default FP32 list = [batch_norm, cross_entropy, log_softmax, softmax, nll_loss, topk]

HMP provides the option of overriding these internal lists, allowing you to provide your own BF16 and FP32 lists (pass bf16_file_path=<.txt> and fp32_file_path=<.txt> as arguments to hmp.convert()). This is particularly useful when customizing mixed precision training for a particular model. For example:

  • Custom BF16 list for Resnet50 = [ addmm, avg_pool2d, bmm, conv2d, dot, max_pool2d, mm, mv, relu, t, linear]

  • Custom FP32 list for Resnet50 = [cross_entropy, log_softmax, softmax, nll_loss, topk] Opt_level = O2

In this mode, only GEMM and Convolution type OPs (e.g. conv1d, conv2d, conv3d, addmm, mm, bmm, mv, dot) should run in BF16 and all other OPs should run in FP32.

2.3.3. Usage Examples

import torch
from habana_frameworks.torch.hpex import hmp

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device=“hpu”)
y = torch.randn(N, D_out, device=“hpu”)

# enable mixed precision training with optimization level O1, default BF16 list, default FP32 list and logging disabled
# use opt_level to select desired mode of operation
# use bf16_file_path to provide absolute path to a file with custom BF16 list
# use fp32_file_path to provide absolute path to a file with custom FP32 list
# use isVerbose to disable/enable debug logs
hmp.convert(opt_level="O1", bf16_file_path="", fp32_file_path="", isVerbose=False)
model = torch.nn.Linear(D_in, D_out).to(torch.device(“hpu”))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for t in range(500):
   y_pred = model(x)
   loss = torch.nn.functional.mse_loss(y_pred, y)

   # disable mixed precision for optimizer block
   with hmp.disable_casts():

2.3.4. Debugging HMP Logs

HMP provides the ability to log precision decisions for each OP for debugging purposes. You can enable verbose logs by passing isVerbose = True as an argument to hmp.convert(). The log prints the precision type each time an OP (covered by a Python decorator) is called in the model being run. See the example below:

casting  <method '__mul__' of 'torch._C._TensorBase' objects>  to  to_fp32
casting  <built-in method embedding of type object at 0x7feab47edfa0> to_fp32
casting  <function layer_norm at 0x7feaab2a4320> to_bf16
casting  <function dropout at 0x7feaab2a2e60> to_bf16
casting  <method 'matmul' of 'torch._C._TensorBase' objects> to_bf16
casting  <method '__iadd__' of 'torch._C._TensorBase' objects>  to  to_bf16 Visualizing Torch Graph

Use torchviz (https://github.com/szagoruyko/pytorchviz) to visualize the model graph and check if cast nodes are inserted at expected positions to convert portions of the model to BF16.


Cast nodes show up as “CopyBackwards” in the graph.

2.4. PyTorch Examples

This section describes how to train models using Habana PyTorch with Gaudi.

2.4.1. Run Models in Habana Model Repository

  1. Make sure the drivers and firmware required for Gaudi are installed. For Docker setup and installation details, refer to the Setup and Install GitHub page.

  2. Clone the models located in Model References GitHub page `Model-References GitHub repository using Git clone.

  3. Launch runs on Gaudi using the README instructions located in PyTorch Models on GitHub.

2.4.2. Model Specific Requirement files

PyTorch Docker image is pre-installed with common python packages. You may be required to install model specific python packages using the requirement files. Please refer to the model specific README in PyTorch Models repository on GitHub.

2.5. Porting a PyTorch Model to Gaudi

To port your own models on Gaudi, refer to the Porting a Simple PyTorch Model to Gaudi section located in the Migration Guide.

2.6. Host and Device Ops Placement

For details on placement of Ops on HPU, refer to Placement of Ops on HPU.

2.7. Runtime Flags

The following table describes runtime flags that are set in the environment to change the behavior as well as enable or disable some features.







Enables Lazy Execution mode.

Habana PyTorch Bridge Modules


All Modules

A Bitmask specifying Habana PyTorch Bridge module to enable logging.

  • 0x1 - Device logs

  • 0x2 - PT kernel/ops logs

  • 0x4 - Bridge logs

  • 0x8 - Synapse helper logs

  • 0x10 - Distributed module logs

  • 0x20 - Lazy mode logs

  • 0x40 - Pinned memory logs

  • 0x80 - CPU fallback logs

Habana PyTorch Bridge




A Bitmask specifying Habana PyTorch Bridge logging level from SynapseAI and perf_lib.

  • FATAL = 1

  • WARNING = 2

  • TRACE = 4

  • DEBUG = 8

Habana PyTorch Bridge Modules



If set to true, enables printing SynapseAI logs to the console.




Logging level from SynapseAI and perf_lib.

  • 6 is no logs

  • 0 is verbose

By default, logs are placed either in the console (if ENABLE_CONSOLE=true) or under ~/.habana_logs/.




Creates of graph visualization files. The output dump graphs are in ./.graph_dumps folder


2.8. Python Package (habana_frameworks.torch)

This package provides PyTorch bridge interfaces and modules such as optimizers, mixed precision configuration, fused kernels for training on HPU and so on.

The various modules are organized as listed in the below example:


The following sections provided a brief description of each module.

2.8.1. core

core module provides Python bindings to PyTorch-Habana bridge interfaces. For example, mark_step which is used to trigger execution of accumulated graphs in Lazy mode.

2.8.2. hccl

hccl module registers and adds support for HCCL communication backend.

2.8.3. hpex/hmp

hpex/hmp module contains the habana_mixed_precision (hmp) tool which can be used to train a model in mixed precision on HPU. Refer to PyTorch Mixed Precision Training on Gaudi for further details.

2.8.4. hpex/kernels

hpex/kernels module contains Python interfaces to Habana only custom operators, such as EmbeddingBag and EmbeddingBagPreProc operators.

2.8.5. hpex/normalization

hpex/normalization module contains Python interfaces to the Habana implementation for common normalize & clip operations performed on gradients in some models. Usage of Habana provided implementation can provide better performance (compared to equivalent operator provided in torch). Refer to Other Custom OPs for further details.

2.8.6. hpex/optimizers

hpex/optimizers contains Python interfaces to Habana implementation for some of the common optimizers used in DL models. Usage of Habana implementation can provide better performance (compared to corresponding optimizer implementations available in torch). Refer to Custom Optimizers for further details.

2.8.7. utils

utils module contains general Python utilities required for training on HPU, such as load_habana_module which is used to load Habana libraries required for PyTorch to register HPU as one of the available devices.