PyTorch Gaudi Theory of Operations

The Intel® Gaudi® PyTorch bridge interfaces between the framework and Intel Gaudi software stack to drive the execution of deep learning models on the Intel Gaudi AI accelerator. The installation package provided by Intel Gaudi comes with modifications on top of the PyTorch release. The customized framework from this installed package needs to be used to integrate PyTorch with the Intel Gaudi PyTorch bridge. PyTorch deep learning model training scripts need to load the PyTorch Intel Gaudi plugin library and import habana_frameworks.torch.core module to integrate with Intel Gaudi PyTorch bridge.

Further integration details can be found at Importing PyTorch Models Manually.

The Intel Gaudi PyTorch bridge supports various modes of execution for a PyTorch model. The following modes are supported:

  • Eager mode - Op-by-op execution as defined in standard PyTorch Eager mode scripts.

  • Eager mode extended with torch.compile - Similar to Eager mode but extended with wrapping complete or part of model (such as a function) into a graph. Parts that are not wrapped are executed eagerly.

  • Lazy mode - Deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi. Unlike Eager Mode with torch.compile, graph is analyzed in each iteration leading to a higher CPU usage.

Note

Eager mode as a subset of Lazy mode is deprecated.

Eager Mode

During Eager mode execution, the framework executes one op at a time from python. The Intel Gaudi PyTorch bridge registers these ops for Gaudi device and drives the execution on Gaudi. For any op that is not supported by the Gaudi device, the Intel Gaudi PyTorch bridge falls back to CPU to execute the op and continue execution on the device for further supported ops thereafter.

Starting from v1.13.0 release, a preview of PyTorch Eager mode is supported. This is the default mode if Lazy mode is disabled using the following environment variable:

PT_HPU_LAZY_MODE=0

This flag needs to be set before invoking a python script importing HPU, import habana_frameworks.torch, as the since Eager mode uses a separate backend library and replaces Lazy backend.

Eager mode support, as a subset of Lazy mode, is deprecated. The functionality can be emulated by using PT_HPU_MAX_COMPOUND_OP_SIZE environment variable and limiting cluster sizes to 1. This will result in a small graph (1 element) compilation after each op.

The software stack reads this environment variable once at the time of initialization:

os.environ["PT_HPU_MAX_COMPOUND_OP_SIZE"] = "1"

Note

Currently, Eager mode is supported on PyTorch ResNet50 for Gaudi 2. See Model References GitHub repository.

Eager Mode with torch.compile

torch.compile, introduced in PyTorch 2.0, allows a model (or part of a model such as a function) to be enclosed into a graph.

The model script requires additional changes pointing to the parts which need to be treated as part of torch.compile. See torch.compile documentation and torch.compile tutorial for more details.

Model parts wrapped with torch.compile are compiled once at the beginning and later such compiled part is called. Model parts without such wrapping run in pure Eager mode, each OP separately which affects overall performance.

In upcoming Intel Gaudi software releases, Eager mode extended with torch.compile will replace Lazy Mode. Unlike Lazy mode, Eager mode with torch.compile does not require rebuilding a graph in each iteration which reduces host computation overhead.

The below shows an example of MNIST extended to use torch.compile:

def train(args, model, device, train_loader, optimizer, epoch):
model.train()
model = torch.compile(model,backend="hpu_backend")
def train_function(data, target):
    optimizer.zero_grad()
    output = model(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    return loss
training_step = 0
for batch_idx, (data, target) in enumerate(train_loader):
    data, target = data.to(device), target.to(device)
    loss = train_function(data, target)
    if batch_idx % args.log_interval == 0:
        print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
            epoch, batch_idx *
            len(data), len(train_loader.dataset)/args.world_size,
            1.   * batch_idx / len(train_loader), loss.item()))
        if batch_idx != 0 and args.dry_run:
            break
    if args.max_training_step != 0:
        training_step +=1
        if training_step == args.max_training_step:
            break

As a backend parameter to torch.compile, in case of HPU, user must provide hpu_backend for both training and inference.

Note

Currently, Eager mode with torch.compile support is also available in a limited numbers of models for Gaudi 2:

  • PyTorch Lightning ResNet50

  • BERT Pretraining phase1

  • BERT Nvidia FineTuning

See Model References GitHub repository.

Lazy Mode

With this mode, users retain the flexibility and benefits that come with the PyTorch define-by-run approach of the Eager mode. The Intel Gaudi PyTorch bridge internally accumulates these ops in a graph. The execution of the ops in the accumulated graph is triggered in a Lazy manner, only when a tensor value is required by the user. This allows the bridge to construct an Intel Gaudi graph with multiple ops, which provides the Intel Gaudi graph compiler the opportunity to optimize the device execution for these ops. The figure below shows the architectural diagram for the PyTorch Intel Gaudi full stack.

../../_images/PyTorch_SW_Stack_Intel.png

Figure 6 PyTorch Intel Gaudi Full Stack Architecture

Intel Gaudi PyTorch Bridge

This section describes the major components of the Intel Gaudi PyTorch bridge.

Intel Gaudi Software Lowering Module

The Lowering module converts the framework provided op or graph to Intel Gaudi software. The PyTorch framework dispatches the execution to the registered methods in the Intel Gaudi PyTorch bridge when an op is invoked on Intel Gaudi tensors. In Lazy evaluation mode, the Intel Gaudi bridge internally builds a graph with accumulated ops. Once a tensor is required to be evaluated, the associated graph that needs to be executed is identified for the resulting tensor. Various optimization passes are applied to the graph, such as:

  • fusion of ops that are beneficial for Gaudi,

  • optimal placement of permutations for channel last memory format of tensors,

  • identification of persistent, non-persistent and tensors with duplicate or aliased memory.

PyTorch Kernels

The PyTorch Kernel module within the Intel Gaudi PyTorch bridge provides the functionality to convert a PyTorch op into appropriate Intel Gaudi software ops. The PyTorch op could be implemented with a single or multiple TPC/MME ops. The PyTorch Kernel module adds these set of software ops to the graph and converts the PyTorch tensors to Intel Gaudi tensors for building the Intel Gaudi graph.

Execution Module

The Execution module in the Intel Gaudi PyTorch bridge provides the functionality to compile an Intel Gaudi graph and launch the resulting recipe in an asynchronous method. The recipes are cached by the Intel Gaudi PyTorch bridge to avoid recompilation of the same graph. This caching is done at an eager op level as well as at a JIT graph level. During training, the graph compilation is only required for the initial iteration, thereafter, the same compiled recipe is re-executed every iteration (with new inputs) unless there is a change in the ops being executed.

Memory Manager

The Intel Gaudi PyTorch bridge has a memory manager that optimally serves the allocation and free requests from the device memory. The Intel Gaudi PyTorch bridge additionally provides the capability to create tensors with pinned memory, which reduces the time required for doing a DMA by avoiding a copy on the host side. The pinned memory feature can be expressed on a tensor with existing flags provided by the PyTorch framework.

Mixed Precision Support

Gaudi supports mixed precision training using PyTorch autocast. Autocast is a native PyTorch module that allows running mixed precision training without extensive modifications to existing FP32 model script. It executes operations registered to autocast using lower precision floating datatype. The module is provided using the torch.amp package.

For more details on PyTorch autocast, see Mixed Precision Training with PyTorch Autocast.

Distributed Training

Intel Gaudi PyTorch implements HCCL communication backend to support scale-up and scale-out. See Distributed Training with PyTorch.

Intel Gaudi Media Loader

habana_dataloader is an accelerated dataloader which can operate in different modes. The optimal one is selected based on the underlying hardware:

  • In Gaudi 2, the dataloader uses hardware-based decoders for acceleration, lowering the load on the host CPU.

  • In first-gen Gaudi, it uses either the framework default dataloader or AEON based dataloader, depending on the use case. Both are done on the host CPU.

For further details on habana_dataloader setup and usage, refer to Intel Gaudi Media Loader.

The habana_dataloader inherits the native torch.utils.data.DataLoader and maintains the same interface from the user perspective. Internally, habana_dataloader falls back to the native torch data loader if the provided parameters are not supported.

The dataloader is imported and used similar to the torch DataLoader. For example:

import habana_dataloader
habana_dataloader.HabanaDataLoader(
    dataset, batch_size=args.batch_size, sampler=train_sampler,
    num_workers=args.workers, pin_memory=True, drop_last=True)

The following are full examples of models using habana_dataloader with PyTorch: