PyTorch CustomOp API
On this Page
PyTorch CustomOp API¶
This document describes API exposed to write custom PyTorch operators for the Intel® Gaudi® AI accelerator. The API provides an ability to implement a custom HPU kernel for new PyTorch operators. This allows PyTorch to execute a given operator on a Gaudi.
Note
The API described in this document allows to create CustomOp in all PyTorch modes: Lazy, Eager, and torch.compile
. The API supported
only in Lazy mode and is now deprecated, but still can be used. For the legacy PyTorch CustomOp API documentation, see CustomOp API Legacy.
Prerequisites¶
TPC is a fully programmable core designed for workloads that do not map to matrix multiplication operations. TPC kernel refers to a concrete implementation that performs a desired operation. Before working with PyTorch CustomOp API, the user must prepare the TPC kernel.
This document does not describe how to implement custom TPC kernels. For information on how to write TPC kernels, refer to the following:
API Overview¶
The main part of the public interface resides in hpu_custom_op_pt2.h
header file.
It contains all the necessary declarations to define a custom Intel Gaudi PyTorch kernel.
The following lists the most important classes and structs to interact with:
UserCustomOpDescriptor - Descriptor with all the needed information for the custom kernel.
PartialOutputMetaData - PyTorch CustomOp output tensors parameters.
Basic Workflow for Writing CustomOp¶
In order to define custom UserCustomOpDescriptor, call registerUserCustomOp function:
Define output metadata callback function OutputMetaFn. See execution example here.
Define kernel user params callback function FillParamsFn if necessary. See execution example here.
Call registerUserCustomOp with schema name, tpc guid, and callback functions pointers.
Create the main execution function for CustomOp:
Access UserCustomOpDescriptor registered in previous step using getUserCustomOpDescriptor.
Call execute with vector of IValue inputs.
Define PyTorch schema for CustomOp using TORCH_LIBRARY and TORCH_LIBRARY_IMPL:
Define op schema using TORCH_LIBRARY.
Define PyTorch dispatcher function with the function from the previous section using TORCH_LIBRARY_IMPL.
Define Meta implementation for
torch.compile
usage using TORCH_LIBRARY_IMPL. See examples in Meta Implementation and Real life PyTorch example.
API Limitations¶
Single TPC Kernel Definition per UserCustomOpDescriptor¶
UserCustomOpDescriptor can define only a single TPC kernel within its implementation.
If a given complex operation requires more than one TPC kernel to represent it, there are two options:
You can implement a new TPC kernel combining the functionality of simple TPC kernels.
Or, if possible, represent complex operation as a series of simple operation at Python-level.
Memory Layout¶
Currently, memory layout is set to Contiguous.
torch.compile
Support¶
CustomOp must be registered in the custom_op
namespace to be fully supported in the torch.compile
mode.
There are no restrictions for the namespace name in Lazy and Eager modes.
CustomOp Installation¶
To create a library containing a new CustomOp implementation, compile the op.cpp
file
against the habana_pytorch_plugin
. There are two plugins for the different execution modes:
habana_pytorch2_plugin
- for Eager andtorch.compile
modes.habana_pytorch_plugin
- for Lazy mode.
CustomOp Loading¶
Once the CustomOp is built, it needs to be loaded in the topology in Python. PyTorch has a util function to load the library:
import torch # it is important to load the module before loading custom op libs torch.ops.load_library(custom_op_lib_path) # output = torch.ops.<custom_op_schema>(<inputs>) a_topk_hpu, a_topk_indices_hpu = torch.ops.custom_op.custom_topk(a_hpu, 3, 1, False)
API Usage Examples¶
An example of how to use the API can be found in PyTorch Model References GitHub page.