4. TensorFlow CustomOp API

This document describes API exposed to write custom TensorFlow Operators for the Habana Accelerator.

4.1. Overview

The API provides an ability to implement a custom HPU Kernel (Habana Processing Unit) for new or existing TensorFlow operators. This allows TensorFlow to execute a given Operator on a Habana Gaudi.

4.1.1. Prerequisites

Core prerequisite of TensorFlow CustomOp API is the preparation of the Habana TPC kernel (Tensor Processor Core), that should be executed. TPC is a fully programmable core designed for workloads that do not map to matrix multiplication operations. TPC kernel refers to a concrete implementation that performs a desired operation. It is the user’s responsibility to prepare the TPC kernel before working with TensorFlow CustomOp API. This document does not describe how to implement custom TPC kernels.

For information on how to write TPC kernels, please refer to the following:

4.2. API Overview

The main part of the public interface resides in hpu_kernel.h and hpu_kernel_context.h header files. They contain all the necessary declarations to define custom TensorFlow Kernel.

Note: All classes exposed by this API reside in inline namespace v1. This namespace should not be explicitly used. When working with API, it serves as potential API versioning.

The following lists the most important classes to interact with:

4.2.1. Basic Workflow

  1. In order to define custom HpuKernel, implement an interface BaseFunctor.

  2. Within BaseFunctor::DefineNode(…) method, perform the following:

  3. In order to implement kernel, BaseFunctor::DefineNode(…) receives HpuKernelContext. With its help, you can define the two necessary components in one of two ways:

  4. Once the kernel is defined, registration happens with REGISTER_KERNEL_BUILDER (coming from tensorflow/core/framework/op_kernel.h) combined with a dedicated class ForHpuWithName. (For example, follow the link).

4.3. Clustering

Clustering is a feature designed to optimize performance. It is a unique feature developed for Habana® Accelerators and a main distinction from GPU approach.

The key aspect of Clustering revolves around transforming input TensorFlow graph into intermediate graph representation that is compiled with a dedicated internal Graph Compiler. The goal of the transformation is to cluster many operations and form bigger execution graphs. These graphs are fed into the compiler which produces programs that maximize usage of all the distinct hardware components of the accelerator device. It is also the reason why users of the API are not implementing tensorflow::OpKernel, but rather habana::BaseFunctor. It indicates that a functor is not performing computation in each iteration, but is invoked only once to form a bigger execution graph.

On GPU, there is no such concept as clustering, hence, their tensorflow::OpKernels are implemented directly.

4.3.1. How it Works with TensorFlow

TensorFlow operates on computation graphs that are constructed from a model provided in Python with tensorflow module.

There are two execution phases in TensorFlow, called graph-time and compute-time. In graph-time, the graph is created, partitioned, and assigned to specific devices and all the optimization passes take place (i.e. Clustering). In compute-time, the actual computations are happening on a given input. Compute-time is a phase where all the shapes of tensors are known, therefore, BaseFunctor::DefineNode(…) is invoked only in compute-time.

In version 1.x of TensorFlow, graphs are executed in a special construct (tf.Session) that provides encapsulated environment for executing Ops. In a session, all the optimizations are performed and graphs are clustered.

In version 2.x of TensorFlow, tf.Session is replaced with Eager execution. You no longer have to define session to run graph, but can run Ops directly one by one (eagerly). Such a flow prevents TensorFlow from generating computation graphs, so in Eager execution clustering is disabled. However, there is another construct called tf.function that allows grouping Ops together in Eager execution and form computation graphs that can be clustered.

4.4. API Limitations

4.4.1. Single TPC Kernel Definition per HpuKernel

It is the main assumption of this API. BaseFunctor inside HpuKernel can define only a single TPC kernel within its implementation. HpuKernelContext prevents defining more than one node in its interface.

If a given complex operation requires more than one TPC kernel to represent it, there are two options:

  • You can implement a new TPC kernel combining the functionality of simple TPC kernels,

  • Or, if possible, represent complex operation as a series of simple operation at Python-level.

4.4.2. Device Addresses are not Accessible

When a kernel is executed on the HPU device, the input&output Tensors do not reside in CPU-accessible memory. They are placed on the HPU device. The API does not allow users to access any device-specific addresses. It allows only to inspect metadata for a given Tensor (DataType and Shape). The exceptions are Host-memory tensors. They reside, by definition, on Host and can be accessed with dedicated function HpuKernelContext::HostMemoryInput(…).

4.5. Future Extensions

4.5.1. Dynamic Shapes

In TensorFlow there is a distinction between static and dynamic shapes. Static shape is an output shape that can be computed based on all the input shapes (for a given Op). Dynamic shape is an output shape that can only be computed based on concrete input values of an Op. Hence, TensorFlow, can only know such shapes after all the preceding computations are done.

An example of Op that produces a dynamic shape tensor is WhereOp. Its output shape is dependent on the input values (depending on how many values it detects).

Dynamic shapes require special handling in order to work properly on Habana® Accelerator. Their support is currently partial, meaning:

  • If a CustomOp is running in a topology with Ops producing dynamic shapes (i.e. WhereOp), the topology will work in a sub-optimal way.

  • CustomOps that need to produce a dynamic shape tensor are not supported.

4.5.2. In-place Updates

In-place update is a feature mainly useful for optimizers (i.e. SGD, AdamOptimizer, etc.). TensorFlow provides API to define a set of Ops that update tensors linked to tf.Variables. Such operators share a common convention that the updated tensor is not produced as an output, since the input is updated in-place.

An ability to define in-place update on a given tf.Variable input is currently not supported.

4.5.3. HostMemory Output Tensors

Custom TensorFlow Ops with HostMemory outputs are not supported.

4.6. Examples

An example of how to use the API can be found in Model References repository.