PyTorch CustomOp API

This document describes API exposed to write custom PyTorch Operators for the Intel® Gaudi® AI accelerator.


The API provides an ability to implement a custom HPU Kernel for new PyTorch operators. This allows PyTorch to execute a given Operator on a Gaudi.


TPC is a fully programmable core designed for workloads that do not map to matrix multiplication operations. TPC kernel refers to a concrete implementation that performs a desired operation. It is the user’s responsibility to prepare the TPC kernel before working with PyTroch CustomOp API.


This document does not describe how to implement custom TPC kernels.

For information on how to write TPC kernels, please refer to the following:

API Overview

The main part of the public interface resides in hpu_custom_op.h header file. They contain all the necessary declarations to define custom HPU PyTorch Kernel.

The following lists the most important classes and structs to interact with:

Basic Workflow

  1. In order to define custom HabanaCustomOpDescriptor, call REGISTER_CUSTOM_OP_ATTRIBUTES macro:

    • Define input vector InputDesc for all inputs of kernel.

    • Define output vector OutputDesc for all outputs of kernel.

    • Call Macro with schema name, tpc guid, inputs, outputs and user param callback function.

  2. Create the main excution function for CustomOp:

  3. Define PyTroch schema for CustomOp using TORCH_LIBRARY and TORCH_LIBRARY_IMPL.

    • Define op schema using TORCH_LIBRARY.

    • Define PyTorch dispatcher function with the function from the previous section using TORCH_LIBRARY_IMPL.

API Limitations

Single TPC Kernel Definition per HabanaCustomOpDescriptor

It is the main assumption of this API. HabanaCustomOpDescriptor can define only a single TPC kernel within its implementation.

If a given complex operation requires more than one TPC kernel to represent it, there are two options:

  • You can implement a new TPC kernel combining the functionality of simple TPC kernels.

  • Or, if possible, represent complex operation as a series of simple operation at Python-level.

Memory Layout

Currently, memory layout will be taken from the input 0 Tensor memory layout.

Output Shape

If the user does not set the output shape callback function, the output shape will be the same as input 0 Tensor shape.

Inputs Types to CustomOp

Currently, only Tensor and Scalar are supported as input types to CustomOp. Meaning, no arrays of any type are supported.


Once the CustomOp is built, it needs to be loaded in the topology in Python. PyTorch has a util function to load the library:

import torch
# it is important to load the module before loading custom op libs
# output = torch.ops.<custom_op_schema>(<inputs>)
a_topk_hpu, a_topk_indices_hpu = torch.ops.custom_op.custom_topk(a_hpu, 3, 1, False)

An example of how to use the API can be found in PyTorch Model References GitHub page.