PyTorch Gaudi Integration Architecture

The PyTorch Habana bridge interfaces between the framework and SynapseAI software stack to drive the execution of deep learning models on the Habana Gaudi device. The installation package provided by Habana comes with modifications on top of the PyTorch release. The customized framework from this installed package needs to be used to integrate PyTorch with the Habana bridge. PyTorch deep learning model training scripts need to load the PyTorch Habana plugin library and import habana_frameworks.torch.core module to integrate with Habana Bridge.

Further integration details can be found at Porting a Simple PyTorch Model to Gaudi.

The Habana bridge supports various modes of execution for a PyTorch model. The following modes are supported:

  • Eager mode - op-by-op execution as defined in standard PyTorch eager mode scripts.

  • Lazy mode - deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi.

This section includes the following:

Eager Mode

During eager mode execution, the framework executes one op at a time from python. The Habana bridge registers these ops for Habana device and drives the execution on Gaudi. For any op that is not supported by the Habana device, the bridge falls back to CPU to execute the op and continue execution on the device for further supported ops thereafter.

Habana’s integration with PyTorch supports Eager execution mode. Although the default execution mode is Lazy, Eager mode can be enabled from within a python script by setting an environment variable as shown below. Alternatively, you can set the environment variable on the commandline.

The software stack reads this environment variable once at the time of initialization.

os.environ["PT_HPU_LAZY_MODE"] = "2"

Lazy Mode

With this mode, users retain the flexibility and benefits that come with the PyTorch define-by-run approach of the Eager mode. The Habana bridge internally accumulates these ops in a graph. The execution of the ops in the accumulated graph is triggered in a lazy manner, only when a tensor value is required by the user. This allows the bridge to construct a SynapseAI graph with multiple ops, which provides the SynapseAI graph compiler the opportunity to optimize the device execution for these ops. The figure below shows the architectural diagram for the PyTorch Habana full stack.


Figure 9 PyTorch Habana Full Stack Architecture

PyTorch Habana Bridge

This section describes the major components of the PyTorch Habana Bridge.

Synapse Lowering Module

The SynapseAI Lowering module converts the framework provided op or graph to SynapseAI. The PyTorch framework dispatches the execution to the registered methods in the Habana bridge when an op is invoked on Habana tensors. In lazy evaluation mode, the Habana bridge internally builds a graph with accumulated ops. Once a tensor is required to be evaluated, the associated graph that needs to be executed is identified for the resulting tensor. Various optimization passes are applied to the graph, such as:

  • fusion of ops that are beneficial for Gaudi,

  • optimal placement of permutations for channel last memory format of tensors,

  • identification of persistent, non-persistent and tensors with duplicate or aliased memory.

PyTorch Kernels

PyTorch Kernel module within the Habana bridge provides the functionality to convert a PyTorch op into appropriate SynapseAI ops. The PyTorch op could be implemented with a single or multiple TPC/MME ops. The PyTorch Kernel module adds these set of SynapseAI ops to the SynapseAI graph and converts the PyTorch tensors to SynapseAI tensors for building the SynapseAI graph.

Execution Module

The Execution Module in the Habana bridge provides the functionality to compile a SynapseAI graph and launch the resulting recipe in an asynchronous method. The recipes are cached by the Habana bridge to avoid recompilation of the same graph. This caching is done at an eager op level as well as at a JIT graph level. During training, the graph compilation is only required for the initial iteration, thereafter, the same compiled recipe is re-executed every iteration (with new inputs) unless there is a change in the ops being executed.

Memory Manager

The Habana bridge has a memory manager that optimally serves the allocation and free requests from the device memory. The Habana bridge additionally provides the capability to create tensors with pinned memory, which reduces the time required for doing a DMA by avoiding a copy on the host side. The pinned memory feature can be expressed on a tensor with existing flags provided by the PyTorch framework.

Mixed Precision Support

Habana PyTorch supports mixed precision execution using the Habana Mixed Precision (HMP) package. You can execute the ops in FP32 or BF16 precision. The HMP package modifies the python operators to add the appropriate cast operations for the arguments before execution. Refer to PyTorch Mixed Precision Training on Gaudi for further details.

Distributed Training

Habana PyTorch implements HCCL communication backend to support scale-up and scale-out. See Distributed Training with PyTorch.

Habana Data Loader

Habana data loader is a CPU based accelerated data loader for Imagenet and COCO datasets. It inherits the native and maintains the same interface from the user perspective. Internally, Habana data loader falls back to the native torch data loader if the provided parameters are not supported.

The data loader is imported and used similar to the torch DataLoader. For example:

import habana_dataloader
    dataset, batch_size=args.batch_size, sampler=train_sampler,
    num_workers=args.workers, pin_memory=True, drop_last=True)

See ResNet Model References GitHub page for the full example.

See SSD Model References GitHub page for the full example.


When the provided input parameters are not eligible for CPU acceleration (see Current Limitations), the native torch data loader is initialized and used. In such a case, the following message will be printed:

Failed to initialize Habana Dataloader, error: {error message}
Running with PyTorch Dataloader

Current Limitations

  • Resnet Limitations:

    • Acceleration takes place only with the following parameters:

      • shuffle=False

      • batch_sampler=None

      • num_workers=8

      • collate_fn=None

      • pin_memory=True

      • timeout=0

      • worker_init_fn=None

      • multiprocessing_context=None

      • generator=None

      • prefetch_factor=2

      • persistent_workers=False

      • dataset is torchvision.datasets.ImageFolder

    • The dataset should contain only .jpg or .jpeg files.

    • Acceleration can take place only with the following dataset torchvision transforms (packed as transforms.Compose):

      • RandomResizesCrop

      • CenterCrop

      • Resize

      • ToTensor

      • RandomHorizontalFlip, only with p=0.5

      • Normalize, only with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

  • SSD Limitations:

    • Acceleration takes place only with the following parameters:

      • dataset is an instance of COCODetection. See SSD Model References GitHub page for the full example.

      • batch_sampler=None

      • num_workers=12

      • pin_memory=True

      • timeout=0

      • worker_init_fn=None

      • drop_last=True

      • prefetch_factor=2

      • persistent_workers=False

    • The dataset should be taken from the COCO Dataset webpage.

    • Acceleration can take place only with the following dataset transforms:

      • SSDCropping. See SSD Model References GitHub page for the full example.

      • Resize

      • ColorJitter, only with brightness=0.125, contrast=0.5, saturation=0.5, hue=0.05

      • ToTensor

      • RandomHorizontalFlip, only with p=0.5

      • Normalize, only with mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

      • Encoder. See SSD Model References GitHub page for the full example.