PyTorch CustomOp API

This document describes API exposed to write custom PyTorch operators for the Intel® Gaudi® AI accelerator. The API provides an ability to implement a custom HPU kernel for new PyTorch operators. This allows PyTorch to execute a given operator on a Gaudi.

Note

The API described in this document allows to create CustomOp in all PyTorch modes: Lazy, Eager, and torch.compile. The API supported only in Lazy mode and is now deprecated, but still can be used. For the legacy PyTorch CustomOp API documentation, see CustomOp API Legacy.

Prerequisites

TPC is a fully programmable core designed for workloads that do not map to matrix multiplication operations. TPC kernel refers to a concrete implementation that performs a desired operation. Before working with PyTorch CustomOp API, the user must prepare the TPC kernel.

This document does not describe how to implement custom TPC kernels. For information on how to write TPC kernels, refer to the following:

API Overview

The main part of the public interface resides in hpu_custom_op_pt2.h header file. It contains all the necessary declarations to define a custom Intel Gaudi PyTorch kernel.

The following lists the most important classes and structs to interact with:

Basic Workflow for Writing CustomOp

  1. In order to define custom UserCustomOpDescriptor, call registerUserCustomOp function:

    1. Define output metadata callback function OutputMetaFn. See execution example here.

    2. Define kernel user params callback function FillParamsFn if necessary. See execution example here.

    3. Call registerUserCustomOp with schema name, tpc guid, and callback functions pointers.

  2. Create the main execution function for CustomOp:

    1. Access UserCustomOpDescriptor registered in previous step using getUserCustomOpDescriptor.

    2. Call execute with vector of IValue inputs.

  3. Define PyTorch schema for CustomOp using TORCH_LIBRARY and TORCH_LIBRARY_IMPL:

    1. Define op schema using TORCH_LIBRARY.

    2. Define PyTorch dispatcher function with the function from the previous section using TORCH_LIBRARY_IMPL.

    3. Define Meta implementation for torch.compile usage using TORCH_LIBRARY_IMPL. See examples in Meta Implementation and Real life PyTorch example.

API Limitations

Single TPC Kernel Definition per UserCustomOpDescriptor

UserCustomOpDescriptor can define only a single TPC kernel within its implementation.

If a given complex operation requires more than one TPC kernel to represent it, there are two options:

  • You can implement a new TPC kernel combining the functionality of simple TPC kernels.

  • Or, if possible, represent complex operation as a series of simple operation at Python-level.

Memory Layout

Currently, memory layout is set to Contiguous.

torch.compile Support

CustomOp must be registered in the custom_op namespace to be fully supported in the torch.compile mode. There are no restrictions for the namespace name in Lazy and Eager modes.

CustomOp Installation

To create a library containing a new CustomOp implementation, compile the op.cpp file against the habana_pytorch_plugin. There are two plugins for the different execution modes:

  • habana_pytorch2_plugin - for Eager and torch.compile modes.

  • habana_pytorch_plugin - for Lazy mode.

CustomOp Loading

Once the CustomOp is built, it needs to be loaded in the topology in Python. PyTorch has a util function to load the library:

import torch
# it is important to load the module before loading custom op libs
torch.ops.load_library(custom_op_lib_path)
# output = torch.ops.<custom_op_schema>(<inputs>)
a_topk_hpu, a_topk_indices_hpu = torch.ops.custom_op.custom_topk(a_hpu, 3, 1, False)

API Usage Examples

An example of how to use the API can be found in PyTorch Model References GitHub page.