PyTorch CustomOp API
On this Page
PyTorch CustomOp API¶
This document describes API exposed to write custom PyTorch Operators for the Habana Accelerator.
The API provides an ability to implement a custom HPU Kernel (Habana Processing Unit) for new PyTorch operators. This allows PyTorch to execute a given Operator on a Habana Gaudi.
TPC is a fully programmable core designed for workloads that do not map to matrix multiplication operations. TPC kernel refers to a concrete implementation that performs a desired operation. It is the user’s responsibility to prepare the TPC kernel before working with PyTroch CustomOp API.
This document does not describe how to implement custom TPC kernels.
For information on how to write TPC kernels, please refer to the following:
The main part of the public interface resides in
hpu_custom_op.h header file. They contain all the necessary declarations to define custom HPU PyTorch Kernel.
The following lists the most important classes and structs to interact with:
HabanaCustomOpDescriptor - Descriptor with all the needed information for all custom kernel.
NodeDesc - PyTorch CustomOp info description.
InputDesc - PyTorch CustomOp inputs info.
OutputDesc - PyTorch CustomOp outputs info.
In order to define custom HabanaCustomOpDescriptor, call REGISTER_CUSTOM_OP_ATTRIBUTES macro:
Define input vector InputDesc for all inputs of kernel.
Define output vector OutputDesc for all outputs of kernel.
Call Macro with schema name, tpc guid, inputs, outputs and user param callback function.
Create the main excution function for CustomOp:
Access HabanaCustomOpDescriptor registered in previous step using getCustomOpDescriptor.
Call execute with vector of IValue inputs.
Define PyTroch schema for CustomOp using TORCH_LIBRARY and TORCH_LIBRARY_IMPL.
Define op schema using TORCH_LIBRARY.
Define PyTorch dispatcher function with the function from the previous section using TORCH_LIBRARY_IMPL.
Single TPC Kernel Definition per HabanaCustomOpDescriptor¶
It is the main assumption of this API. HabanaCustomOpDescriptor can define only a single TPC kernel within its implementation.
If a given complex operation requires more than one TPC kernel to represent it, there are two options:
You can implement a new TPC kernel combining the functionality of simple TPC kernels.
Or, if possible, represent complex operation as a series of simple operation at Python-level.
Currently, memory layout will be taken from the input 0 Tensor memory layout.
If the user does not set the output shape callback function, the output shape will be the same as input 0 Tensor shape.
Inputs Types to CustomOp¶
Currently, only Tensor and Scalar are supported as input types to CustomOp. Meaning, no arrays of any type are supported.
Habana Mixed Precision (HMP)¶
CustomOp is not integrated with the Habana Mixed Precision (HMP) package, hence mixed training support via the HMP package will not be applicable for CustomOp. If a CustomOp is required to be executed with BF16, then it can be explicitly written with the BF16 data type and invoked by the user.
Once the CustomOp is built, it needs to be loaded in the topology in Python. PyTorch has a util function to load the library:
import torch # it is important to load the module before loading custom op libs torch.ops.load_library(custom_op_lib_path) # output = torch.ops.<custom_op_schema>(<inputs>) a_topk_hpu, a_topk_indices_hpu = torch.ops.custom_op.custom_topk(a_hpu, 3, 1, False)
An example of how to use the API can be found in PyTorch Model References GitHub page.