2. PyTorch User Guide

2.1. Introduction

The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana Gaudi infrastructure. It provides guidelines for modifying existing models to run on the platform and uses a basic example to show functionality. Currently, support for PyTorch is under active development and available only in Beta.

2.2. PyTorch Habana Processing Unit (HPU) Integration Architecture

The PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to drive the execution of deep learning models on the Habana Gaudi device. The installation package provided by Habana comes with modifications on top of the PyTorch release. The customized framework from this installed package needs to be used to integrate PyTorch with the Habana bridge. PyTorch deep learning model training scripts need to load the PyTorch Habana plugin library and import habana_frameworks.torch.core module to integrate with Habana Bridge.

Further integration details can be found at PyTorch Examples and Porting a PyTorch Model to Gaudi.

The Habana bridge supports various modes of execution for a PyTorch DL model. The following modes are supported:

  • Eager mode - op-by-op execution as defined in standard PyTorch eager mode scripts.

  • Graph mode (TorchScript torch.jit.trace) - TorchScript based execution as defined in PyTorch.

  • Lazy mode - deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi.

2.2.1. Eager Mode

During eager mode execution, the framework executes one op at a time from python. The Habana bridge registers these ops for Habana device and drives the execution on Gaudi. For any op that is not supported by the Habana device, the bridge falls back to CPU to execute the op and continue execution on the device for further supported ops thereafter.

2.2.2. Graph Mode

For TorchScript mode, users need to make changes in the python script to create a TorchScript model using torch.jit.trace as required by the PyTorch framework. In this mode, Habana launches the TorchScript JIT subgraph with a single SynapseAI graph for execution to achieve better compute performance compared to eager mode.

Note

Support for graph mode is minimal and may not be available in future releases.

2.2.3. Lazy Evaluation Mode

With this mode, users retain the flexibility and benefits that come with the PyTorch define-by-run approach of the Eager mode. The Habana bridge internally accumulates these ops in a graph. The execution of the ops in the accumulated graph is triggered in a lazy manner, only when a tensor value is required by the user. This allows the bridge to construct a SynapseAI graph with multiple ops, which provides the SynapseAI graph compiler the opportunity to optimize the device execution for these ops. The figure below shows the architectural diagram for the PyTorch Habana full stack.

../_images/PyTorch_SW_Stack.png

Figure 2.1 PyTorch Habana Full Stack Architecture

2.2.4. PyTorch Habana Bridge

This section describes the major components of the PyTorch Habana Bridge.

2.2.4.1. Synapse Lowering Module

The SynapseAI Lowering module converts the framework provided op or graph to SynapseAI. The PyTorch framework dispatches the execution to the registered methods in Habana bridge when an op is invoked on Habana tensors. In Graph mode, the Habana bridge applies a fusion pass on the JIT graph to encapsulate the differentiable JIT subgraph into a single HabanaFusedOp with the subgraph ops within it. In lazy evaluation mode, the Habana bridge internally builds a graph with accumulated graphs. Once a tensor is required to be evaluated, the associated graph that needs to be executed is identified for the resulting tensor. Various optimization passes are applied to the graph, like fusion of ops that are beneficial for Gaudi, optimal placement of permutations for channel last memory format of tensors, and identification of persistent, non-persistent and tensors with duplicate or aliased memory.

2.2.4.2. PyTorch Kernels

PyTorch Kernel module within the Habana bridge provides the functionality to convert a PyTorch op into appropriate SynapseAI ops. The PyTorch op could be implemented with a single or multiple TPC/MME ops. The PyTorch Kernel module adds these set of SynapseAI ops to the SynapseAI graph and converts the PyTorch tensors to SynapseAI tensors for building the SynapseAI graph.

2.2.4.3. Execution Module

The Execution Module in the Habana bridge provides the functionality to compile a SynapseAI graph and launch the resulting recipe in an asynchronous method. The recipes are cached by the Habana bridge to avoid recompilation of the same graph. This caching is done at an eager op level as well as at a JIT graph level. During training, the graph compilation is only required for the initial iteration, thereafter the same recipe is re-executed with new inputs for models that re-execute without a change in the ops executed every iteration.

2.2.4.4. Memory Manager

The Habana bridge has a memory manager that optimally serves the allocation and free requests from the device memory. The Habana bridge additionally provides the capability to create tensors with pinned memory, which reduces the time required for doing a DMA by avoiding a copy on the host side. The pinned memory feature can be expressed on a tensor with existing flags provided by the PyTorch framework.

2.2.4.5. Mixed Precision Support

Habana PyTorch supports mixed precision execution using the Habana Mixed Precision (HMP) package. You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution. Refer to PyTorch Mixed Precision Training on Gaudi for further details.

2.2.4.6. Distributed Training

The collective ops are implemented using the Habana Communication Library (HCL), which is used to perform communication among different Gaudi cards (see Habana Communication Library (HCL) API Reference). Habana integrates the DistributedDataParallel (DDP) package for providing distributed training support on PyTorch. Habana is integrated with the PyTorch framework to drive distributed communications for Habana tensors over HCL implementation. The DDP is integrated further with Eager, Graph as well as Lazy evaluation execution modes. The PyTorch HPU integration also provides user-friendly tools for hardware monitoring, performance profiling and error debugging. For these tools, refer to the Profiler User Guide.

2.3. PyTorch Mixed Precision Training on Gaudi

Habana Mixed Precision (HMP) package is a tool that allows you to run mixed precision training on HPU without extensive modifications to existing FP32 model scripts. You can easily add mixed precision training support to the model script by adding the following lines anywhere in the script before the start of the training loop:

from hmp import hmp
  hmp.convert()

Any segment of script (e.g. optimizer) in which you want to avoid using mixed precision should be kept under the following Python context:

from hmp import hmp
with hmp.disable_casts():
  code line:1
  code line:2

2.3.1. Basic Design Rules

  • Maintain two different lists (i) OPs that always run in BF16 only, (ii) OPs that always run in FP32 only.

  • Use Python decorators to add required functionality (bf16 or fp32 casts on OP input(s)) to torch functions.

  • Any OPs not in the above two lists will run with precision type of its 1st input (except for exceptions listed below).

  • For OPs with multiple tensor inputs (maintained in a separate list, e.g. add, sub, cat, stack etc.), cast all inputs to widest precision type among all input precision types. If any of these OPs are in BF16 or FP32 list, that list has a higher precedence.

  • For in-place OPs (output & 1st input share storage), cast all inputs to precision type of 1st input.

2.3.2. Configuration Options

HMP provides two modes (opt_level = O1/O2) of mixed precision training to choose from. These modes can be chosen by passing opt_level= as an argument to hmp.convert().

O1 is the default and recommended mode of operation when using HMP. O2 can be used for debugging convergence issues as well as for initial iterations of converting a new model to run with mixed precision.

2.3.2.1. Opt_level = O1

In this mode, OPs that always run in BF16 and OPs that always run in FP32 are selected from a BF16 list and FP32 list respectively. BF16 list contains OPs that are numerically safe to run in lower precision on HPU, whereas FP32 list contains OPs that should be run in higher precision (conservative choice that works across models).

  • Default BF16 list = [addmm, bmm, conv1d, conv2d, conv3d, dot, mm, mv]

  • Default FP32 list = [batch_norm, cross_entropy, log_softmax, softmax, nll_loss, topk]

HMP provides the option of overriding these internal lists, allowing you to provide your own BF16 and FP32 lists (pass bf16_file_path=<.txt> and fp32_file_path=<.txt> as arguments to hmp.convert()). This is particularly useful when customizing mixed precision training for a particular model. For example:

  • Custom BF16 list for Resnet50 = [ addmm, avg_pool2d, bmm, conv2d, dot, max_pool2d, mm, mv, relu, t, linear]

  • Custom FP32 list for Resnet50 = [cross_entropy, log_softmax, softmax, nll_loss, topk]

2.3.2.2. Opt_level = O2

In this mode, only GEMM and Convolution type OPs (e.g. conv1d, conv2d, conv3d, addmm, mm, bmm, mv, dot) should run in BF16 and all other OPs should run in FP32.

2.3.3. Usage Examples

import torch
from hmp import hmp

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device=“habana”)
y = torch.randn(N, D_out, device=“habana”)

# enable mixed precision training with optimization level O1
hmp.convert(opt_level=“O1”, bf16_file_path=“..", fp32_file_path=“..")

model = torch.nn.Linear(D_in, D_out).to(torch.device(“habana”))
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

for t in range(500):
   y_pred = model(x)
   loss = torch.nn.functional.mse_loss(y_pred, y)
   optimizer.zero_grad()
   loss.backward()

   # disable mixed precision for optimizer block
   with hmp.disable_casts():
      optimizer.step()

2.3.4. Debugging

2.3.4.1. HMP Logs

HMP provides the ability to log precision decisions for each OP for debugging purposes. You can enable verbose logs by passing isVerbose = True as an argument to hmp.convert(). The log prints the precision type each time an OP (covered by a Python decorator) is called in the model being run. See the example below:

casting  <method '__mul__' of 'torch._C._TensorBase' objects>  to  to_fp32
casting  <built-in method embedding of type object at 0x7feab47edfa0> to_fp32
casting  <function layer_norm at 0x7feaab2a4320> to_bf16
casting  <function dropout at 0x7feaab2a2e60> to_bf16
casting  <method 'matmul' of 'torch._C._TensorBase' objects> to_bf16
casting  <method '__iadd__' of 'torch._C._TensorBase' objects>  to  to_bf16

2.3.4.2. Visualizing Torch Graph

Use torchviz (https://github.com/szagoruyko/pytorchviz) to visualize the model graph and check if cast nodes are inserted at expected positions to convert portions of the model to BF16.

Note

Cast nodes show up as “CopyBackwards” in the graph.

2.4. PyTorch Examples

This section describes how to train models using Habana PyTorch with Gaudi.

2.4.1. Run Models in Habana Model Repository

  1. Make sure the drivers and firmware required for Gaudi are installed. See Installation Guide. For Docker setup and installation details, refer to the Setup and Install Github page.

  2. Clone the models located in Model References GitHub page using Git clone.

  3. Launch runs on Gaudi using the README instructions located in PyTorch Models on GitHub.

2.5. Porting a PyTorch Model to Gaudi

To port your own models on Gaudi, refer to the Porting a Simple PyTorch Model to Gaudi section located in the Migration Guide.

2.6. CPU vs HPU Usage Definition

When the model is ported to run on HPU, the software stack decides which ops are placed on CPU and which are placed on the HPU. This decision is based on whether the op is registered with PyTorch with HPU as the backend. Execution of an op automatically falls back to CPU if the op is not registered with its backend as HPU.

To enable CPU fallback logs to check whether op execution fell back to CPU, set the environment variables as shown below:

PT_HPU_LOG_MOD_MASK=80 PT_HPU_LOG_TYPE_MASK=8

For example, you will see logs as shown below if execution of ops tril and lgamma_ fall back to the CPU:

CPU fallback tril : self=HABANAHalfType
CPU fallback lgamma_ : self=HABANAFloatType

2.7. Runtime Flags

The following table describes runtime flags that are set in the environment to change the behavior as well as enable or disable some features.

Flag

Default

Description

Consumer

PT_HPU_LAZY_MODE

Unset

Enables Lazy Execution mode.

Habana PyTorch Bridge Modules

PT_HPU_GRAPH_FUSION_OPS_FILE

Unset

Specify the fusion ops list for Trace Mode Execution. Ops specified here will be fused into a single graph.

Habana PyTorch Bridge Modules

PT_HPU_LOG_MOD_MASK

All Modules

A Bitmask specifying Habana PyTorch Bridge module to enable logging.

  • 0x1 - Device logs

  • 0x2 - PT kernel/ops logs

  • 0x4 - Bridge logs

  • 0x8 - Synapse helper logs

  • 0x10 - Distributed module logs

  • 0x20 - Lazy mode logs

  • 0x40 - Pinned memory logs

  • 0x80 - CPU fallback logs

Habana PyTorch Bridge

Modules

PT_HPU_LOG_TYPE_MASK

3

A Bitmask specifying Habana PyTorch Bridge logging level from SynapseAI and perf_lib.

  • FATAL = 1

  • WARNING = 2

  • TRACE = 4

  • DEBUG = 8

Habana PyTorch Bridge Modules

ENABLE_CONSOLE

False

If set to true, enables printing SynapseAI logs to the console.

SynapseAI

LOG_LEVEL_ALL

5

Logging level from SynapseAI and perf_lib.

  • 6 is no logs

  • 0 is verbose

By default, logs are placed either in the console (if ENABLE_CONSOLE=true) or under ~/.habana_logs/.

SynapseAI

GRAPH_VISUALIZATION

False

Enables creation of graph visualization files in SynapseAI.

SynapseAI