2. Habana Collective Communications Library (HCCL) API Reference

2.1. Overview

Habana Collective Communications Library (HCCL) is Habana’s emulation layer of the NVIDIA Collective Communication Library (NCCL). It operates on top of Habana Communication Library (HCL) which is included in the Synapse library. See Habana Communication Library (HCL) API Reference.

HCCL provides the same list of collective communication primitives as NCCL, imitating the NCCL interface for user convenience. HCCL also allows for point-to-point communication.

The following lists the supported collective primitives:

  • AllReduce

  • Broadcast

  • Reduce

  • AllGather

  • ReduceScatter

  • Send (P2P)

  • Recv (P2P)

HCCL has been written for seamless integration with Habana TensorFlow modules. It allows sharing device ownership and synchronization mechanism with Habana TensorFlow modules. It also provides a function set for interacting with the device (moving data, allocating memory, etc.)

Warning

All HCL-related restrictions and its usage also apply to HCCL. See Habana Communication Library (HCL) API Reference.

2.2. Using HCCL

2.2.1. HCCL Library

The HCCL library is provided as a dynamically-linked library named libhccl.so.X.Y.Z, where X.Y.Z denotes Abseil compatible with a specific TensorFlow version. For instance, a library named libhccl.so.2.4.1 has Abseil ABI compatible with TensorFlow 2.4.1. However, the HCCL library does not depend on TensorFlow and technically may be used without it.

To access HCCL C API from C/C++:

#include <hccl/hccl.h>

This header defines HCCL symbols, for example hcclGetVersion function. For symbols which have their counterpart in NCCL API, portability macros are defined as well:

#define ncclGetVersion hcclGetVersion

2.2.2. MPI

HCCL library is built with OpenMPI 4.0.5. Any program using HCCL must be invoked using mpiexec or mpirun`. For example:

mpirun -np 8 --tag-outputs my_program

In addition, any program using HCCL must initialize MPI execution context prior to calling hcclDeviceInit:

MPI_Init(&argc, &argv);

2.2.3. HCCL Runtime Initialization

To initialize HCCL runtime, call hcclDeviceInit function:

int device_handle{};
const int device_cardinal_id = 0;  // Always 0.

hcclResult_t res = hcclDeviceInit(&device_handle, device_cardinal_id));

This call will also open an HPU device or reuse the previously acquired device (if another agent previously acquired a device within the process).

Prior to exiting, the program should unload HCCL by calling hcclDeviceFree:

hcclDeviceFree(device_handle)

2.2.4. Interactions with TensorFlow and HPU Support Modules

HCCL has been designed to cooperate with TensorFlow and HPU support modules loaded. A Python script first loads TensorFlow, then loads the support modules and uses HCCL in custom collective operations defined in C++.

For instance, let’s assume your Python program imports TensorFlow and loads Habana support for HPU accelerators:

import tensorflow
tensorflow.load_library(['graph_writer.so', 'habana_device.so'])

with tensorflow.device('/device:HPU:0'):
    # ...

In such a case, an HPU device will be opened and set up to use a sophisticated memory allocator called BFC Allocator. This memory allocator is implemented in the Habana Device integration component for TensorFlow (habana_device.so).

In case hcclDeviceInit is called after HPU device is acquired by TensorFlow, the device_handle returned by this call will designate the device already acquired by the TensorFlow, having BFC Allocator memory allocator enabled.

Warning

If hcclDeviceInit is called without an HPU device opened earlier by the support modules, a new device will be opened using a simple implementation of memory allocator. That allocator is not compatible with HPU support for TensorFlow.

2.2.5. HCCL Environment Variables

During the HCCL initialization phase, the environment variables presented in the table below are used.

Parameter

Short Description

Value if not set

HCL_TYPE

Type of HLS box used.

“HLS1”

HCCL_BASE_PORT

Port range used for sideband communication.

45000

HCCL_SOCKET_IFNAME

Prefix of name of network interface used for sideband communication.

Auto-detected (see description)

HCL_CONFIG_PATH

Path for configuration file for HCL. See Habana Communication Library (HCL) API Reference.

Ignored

2.2.5.1. HCL_TYPE

HCL_TYPE defines the type of HLS box available in the server. Currently “HLS1” and “HLS1-H” are supported. If this environment variable is not set, the default value is HLS1.

This value is used for auto-generating an HCL configuration file during the HCCL initialization phase. If HCL_CONFIG_PATH is provided, the value is ignored.

2.2.5.2. HCCL_BASE_PORT

HCCL_BASE_PORT defines the beginning of the port range that should be used for HCCL sideband TCP communication. The ports used by HCCL are in range [HCCL_BASE_PORT, HCCL_BASE_PORT+100].

If HCL_CONFIG_PATH is not provided, ports from this range are also assigned for the underlying HCL communicator.

2.2.5.3. HCCL_SOCKET_IFNAME

HCCL_SOCKET_IFNAME defines the prefix of the network interface name that is used for HCCL sideband TCP communication. If not set, the first network interface with a name that does not start with lo or docker will be used.

2.2.5.4. HCL_CONFIG_PATH

HCL_CONFIG_PATH defines the location of the HCL configuration file (Habana Communication Library (HCL) API Reference). If set, make sure that topology described in the HCL config file matches the communicator you want to create in HCCL.

Warning

If HCL_CONFIG_PATH is set, the value of HCL_TYPE will be ignored. Additionally, the port provided in the HCL config file will be used for HCL sideband communication but not for HCCL.

2.2.6. HCCL Streams and Asynchronicity of Issued Operations

To issue a memory copy or a collective operation, the program creates a stream using hcclStreamCreate:

hcclStream_t stream{};
hcclStreamCreate(&stream);

The program may call this function multiple times to obtain multiple stream handles. However, in the current implementation, the HCCL completely ignores all stream handles passed to any function receiving it. In exchange, it implicitly offers a dependency tracking mechanism which automatically tracks down all device addresses used in collective and memory copy operations used as inputs and outputs. Therefore, manual synchronization in the majority of cases is neither necessary nor optimal. However, to block the caller thread until all pending operations complete, function hcclWaitForCompletion may be called.

  • All collective operations are asynchronous - implemented as non-blocking calls.

  • Memory copy operations host->device and device->device are asynchronous.

  • Memory copy operations host->host and device->host are blocking (so the destination pointer to process memory may be immediately used).

  • Multiple consecutive device->device copy operations are internally coalesced into a single underlying Synapse API call. This speeds up the process of forming a Fusion Buffer in Horovod.

After an asynchronous call, a device pointer used as a destination may be immediately used as a source of an another collective operation or memory copy operation.

Warning

Exception: Calling hcclSend right after hcclRecv on the same address. An additional call to hcclWaitForCompletion must take place between them.

Warning

Exception: When one operation produces the result to buffer at some device pointer but the other operation consumes the data using an offset pointer, the dependency tracking mechanism will not synchronize these operations properly. An additional call to hcclWaitForCompletion must take place between producer and consumer operations. For example:

… -> Memory Copy of size 8 bytes (producer) -> X_addr

(X_addr + 4 bytes) -> All-Reduce of size 4 bytes (consumer) -> …

However, the following example will work without any additional (manual) synchronization:

… -> Memory Copy of size 4 bytes (producer) -> (X_addr + 4 bytes)

… -> Memory Copy of size 4 bytes (producer) -> X_addr

X_addr -> All-Reduce of size 8 bytes (consumer) -> …

2.3. C API

2.3.1. Infrastructure

2.3.1.1. hcclStream_t

Opaque handle to HCCL stream.

2.3.1.2. hcclComm_t

Opaque handle to HCCL communicator.

2.3.1.3. hcclUniqueId

Unique Id to be used with hcclCommInitRank. All HCCL communicators within a communicator clique have the same Unique Id.

2.3.1.4. hcclDataType_t

The following HCCL data types are currently supported by the collective operations:

  • hcclFloat32, hcclFloat

  • hcclBfloat16 (added by Habana)

2.3.1.5. hcclResult_t

The following HCCL result types are currently supported:

  • hcclSuccess

  • hcclSystemError

  • hcclInvalidArgument

  • hcclInternalError

  • hcclInvalidUsage

2.3.1.6. hcclRedOp_t

The following Ops are supported:

  • hcclSum

Warning

hcclProd, hcclMax, and hcclMin are not supported by HCCL.

2.3.1.7. hcclMemcpyKind_t

Specifies the direction of memory copy transfer used in function hcclMemcpy. The following values are supported:

  • hcclMemcpyHostToDevice - Source is in the process memory, destination is a device address.

  • hcclMemcpyDeviceToHost - Source is in a device address, destination is in the process memory.

  • hcclMemcpyDeviceToDevice - Both source and destinations are device addresses.

Warning

hcclMemcpyDefault (inferring the address kind from pointers) is not supported by HCCL.

2.3.2. HCCL-specific API

Habana supports one device per process and one or multiple HLS-1 servers. The supported API functions make the minimum set of API functions that are needed to run HCCL on Gaudi devices.

2.3.2.1. hcclDeviceInit

Operation:

Initializes HCCL runtime and opens an HPU device. The device is reused if already opened earlier in the process. If a device has been already opened by TensorFlow with Habana support modules, HCCL will reuse the memory allocator provided by those modules. Otherwise, HCCL will use a simple implementation of memory allocator. ‘ordinal’ parameter is currently unused and must be set to 0.

Parameters:

Parameter

Description

device_handle

[out] Device handle of an acquired device.

ordinal

[in] Reserved for further use: must be 0.

Return Value:

The status of the operation.

2.3.2.2. hcclDeviceFree

Operation:

Releases the HPU device and cleans up HCCL runtime.

Parameters:

Parameter

Description

device_handle

[in] Device handle to be closed.

Return Value:

The status of the operation.

2.3.2.3. hcclMalloc

Operation:

Allocates memory on the HPU device.

Parameters:

Parameter

Description

dev_ptr

[out] Allocatod device address.

size

[in] Size in bytes to be allocated.

Return Value:

The status of the operation.

2.3.2.4. hcclFree

Operation:

Frees memory from the HPU device.

Parameters:

Parameter

Description

dev_ptr

[in] Device address to be freed.

Return Value:

The status of the operation.

2.3.2.5. hcclMemcpy

Operation:

Copies data between host and device.

Parameters:

Parameter

Description

dst

[out] Destination address. Can be host or device address depending on kind parameter.

src

[in] Source address. Can be host or device address depending on kind parameter.

count

[in] Size in bytes to be copied.

kind

[in] Direction of the copy operation, i.e. Placement of dst and src addresses on host/device.

Return Value:

The status of the operation.

2.3.2.6. hcclWaitForCompletion

Operation:

A blocking call which waits until issued asynchronous memory copy and collective operations are finished.

Parameters:

Parameter

Description

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

2.3.3. NCCL-Compatible API

Habana supports one device per process and one HLS-1 (up to 8 devices). The supported API functions are listed below. Other HCCL API functions are no-op. The supported API functions make the minimum set of API functions that are needed to run HCCL on Gaudi devices.

2.3.3.1. hcclGetUniqueId

hcclResult_t hcclGetUniqueId(hcclUniqueId* uniqueId)

Operation:

One of two functions used for creating HCCL communicator. Creates a uniqueId required for hcclCommInitRank.

This function should be called only by one process within a new communicator. Obtained uniqueId is required for HCCL communicator initialization and must be distributed to other ranks using separate communication channel (e.g. MPI).

Parameters:

Parameter

Description

uniqueId

[out] commId used in hcclCommInitRank.

Return Value:

The status of the operation.

2.3.3.2. hcclCommInitRank

hcclResult_t hcclCommInitRank(hcclComm_t* comm, int nranks,
hcclUniqueId commId, int rank)

Operation:

Creates an HCCL communicator with given commId. This call is blocking and will return only after being called on every rank of new communicator.

commId should be obtained by hcclGetUniqueId on one of the ranks and distributed using separate communication channel (e.g. MPI).

Each HCCL rank of created communicator is associated with a single Gaudi device.

Parameters:

Parameter

Description

comm

[out] Handle of initialized HCCL communicator

nranks

[in] Total number of ranks in the created communicator clique. It can be 2, 4 and 8.

commId

[in] Communicator Id returned by hcclGetUniqueId. Communicators created using same Id belong to same communicator clique.

rank

[in] The rank of the current device/process.

Return Value:

The status of the operation.

2.3.3.3. hcclCommCuDevice

hcclResult_t hcclCommCuDevice(const hcclComm_t comm, int* device)

Operation:

Returns the Gaudi device ID associated with current process in the HCCL communicator comm. The device can be used to, for example, allocate device memory and so on. As there may be only one Gaudi device opened within a process, the returned value will always be 0.

Parameters:

Parameter

Description

comm

[in] The communicator created by hcclCommInitRank.

device

[out] Gaudi device ID associated with comm.

2.3.3.4. hcclCommCount

hcclResult_t hcclCommCount(const hcclComm_t comm, int* count)

Operation:

Returns the total number of ranks in the HCCL communicator comm.

Parameters:

Parameter

Description

comm

[in] The communicator.

count

[out] The total number of ranks.

2.3.3.5. hcclCommUserRank

hcclResult_t hcclCommUserRank(const hcclComm_t comm, int* rank)

Operation:

Returns the rank (assigned to current process in the HCCL communicator comm.

Parameters:

Parameter

Description

comm

[in] The communicator.

rank

[out] The rank.

2.3.3.6. hcclCommDestroy

hcclResult_t hcclCommDestroy(hcclComm_t comm)

Operation:

Destroys the HCCL communicator.

Parameters:

Parameter

Description

comm

[in] The communicator to destroy.

Return Value:

The status of the operation.

2.3.3.7. hcclBroadcast

hcclResult_t hcclBroadcast(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, int root, hcclComm_t comm,
hcclStream_t stream_handle))

Operation:

Broadcasts data from a single rank sendbuff to recvbuff of all others ranks that are part of HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Address of send buffer of root rank.

recvbuff

[in] Address of receive buffer of recieve rank.

count

[in] The number of elements.

datatype

[in] The datatype of the operation.

root

[in] The rank of the root of the broadcast operation.

comm

[in] The communicator on which to broadcast.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

2.3.3.8. hcclAllreduce

hcclResult_t hcclAllReduce(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t comm,
hcclStream_t stream_handle))

Operation:

Reduces data from sendbuff across the entire HCCL communicator comm and distributes the results to recvbuff of every rank that is part of comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

count

[in] The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

2.3.3.9. hcclReduce

hcclResult_t hcclReduce(const void* sendbuff, void* recvbuff, size_t
count, hcclDataType_t datatype, hcclRedOp_t op, int root, hcclComm_t
comm, hcclStream_t stream_handle)

Operation:

Reduces data from sendbuff across the entire HCCL communicator comm. Result will be put only in recvbuffer of rank designed as root.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into in root

count

[in]  The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

root

[in] The rank who will contain the result.

comm

[in] The com municator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

2.3.3.10. hcclReduceScatter

hcclResult_t hcclReduceScatter(const void* sendbuff, void* recvbuff,
size_t recvcount, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t
comm, hcclStream_t stream_handle)

Operation:

Reduce-Scatter data across the entire HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

recvcount

[in] The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

2.3.3.11. hcclAllGather

hcclResult_t hcclAllGather(const void* sendbuff, void* recvbuff,
size_t sendcount, hcclDataType_t datatype, hcclComm_t comm, hcclStream_t
stream_handle)

Operation:

Performs allgather across the entire HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

sendcount

[in] The number of elements of the buffer that is moved.

datatype

[in] The datatype of the operation.

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

2.3.4. Example

The following C++ code is an example of setting up HCCL and performing hcclAllreduce on a buffer of 4 floats:

// C++ Standard Libraries
#include <iostream>
#include <exception>

// Open MPI
#include "mpi.h"

// HCCL :: Habana Collective Communications Library
#include <hccl/hccl.h>

#define CHECK_MPI_STATUS(x)                                                             \
  {                                                                                     \
    const auto _res = (x);                                                              \
    if (_res != MPI_SUCCESS)                                                            \
      throw std::runtime_error{"In function " + std::string{__FUNCTION__} +             \
                               "(): " #x " failed with code: " + std::to_string(_res)}; \
  }

#define CHECK_HCCL_STATUS(x)                                                      \
  {                                                                               \
    const auto _res = (x);                                                        \
    if (_res != hcclSuccess)                                                      \
      throw std::runtime_error{"In function " + std::string{__FUNCTION__} +       \
                               "(): " #x " failed: " + hcclGetErrorString(_res)}; \
  }

std::ostream& log() { return std::cout; }

template <typename T>
std::ostream& operator<<(std::ostream& out, const std::vector<T>& data) {
  out << '[';
  if (!data.empty()) {
    out << data[0];
    for (size_t i = 1; i < data.size(); ++i) {
      out << " " << data[i];
    }
  }
  return out << ']';
}

int main(int argc, char* argv[]) {
  try {
    log() << "Running HCCL Example :: A simple program demonstrating HCCL usage from C++" << std::endl;

    // Initialize the Open MPI execution context.
    CHECK_MPI_STATUS(MPI_Init(&argc, &argv));

    // Get MPI rank and size.
    int mpi_rank{};
    CHECK_MPI_STATUS(MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank));

    int mpi_size{};
    CHECK_MPI_STATUS(MPI_Comm_size(MPI_COMM_WORLD, &mpi_size));

    if (mpi_size < 2) {
      throw std::runtime_error{"There is only one MPI rank. Run this program using 'mpirun' or 'mpiexec'"};
    }

    // Initialize HCCL.
    // As TensorFlow is not used here, HCCL will create a new HPU device and use its own HPU allocator.
    // This step can be omitted only if there is an other agent who already initialized HCCL.
    // Habana's version of Horovod with HCCL backend does that by instantiating hccl::gpu_context class.
    // hcclDeviceInit() is a C function which internally creates a new instance of hccl::gpu_context class.
    // If you prefer C++ API, you can use it directly either by creating an new instance of this class,
    // or accessing an existing instance using gpu_context::instance().
    //
    int device_handle{};
    const int device_cardinal_id = 0;  // There may be only one HPU device per process, so it is always 0.

    CHECK_HCCL_STATUS(hcclDeviceInit(&device_handle, device_cardinal_id));

    // Generate Unique Id on rank 0 and propagate it to other ranks using Open MPI.
    //
    hcclUniqueId unique_id{};
    constexpr int master_mpi_rank = 0;

    if (mpi_rank == master_mpi_rank) {
      CHECK_HCCL_STATUS(hcclGetUniqueId(&unique_id));
    }

    CHECK_MPI_STATUS(MPI_Bcast(&unique_id, sizeof(unique_id), MPI_BYTE, master_mpi_rank, MPI_COMM_WORLD));

    // Create a new HCCL communicator.
    hcclComm_t hccl_comm{};
    CHECK_HCCL_STATUS(hcclCommInitRank(&hccl_comm, mpi_size, unique_id, mpi_rank));

    // Create a new stream handle. There is no need to destroy it any way.
    hcclStream_t stream{};
    CHECK_HCCL_STATUS(hcclStreamCreate(&stream));

    // Allocate some buffer on the HPU device.
    // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
    auto input_host_data = std::vector<float>{1.5f, -2.0f, -0.5f, 3.0f};
    const size_t data_size = 4 * sizeof(float);

    void* dev_ptr{};
    CHECK_HCCL_STATUS(hcclMalloc(&dev_ptr, data_size));
    CHECK_HCCL_STATUS(hcclMemcpy(dev_ptr, &input_host_data.front(), data_size, hcclMemcpyHostToDevice, stream));

    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.

    // Perform an All-Reduce operation on the device buffer.
    CHECK_HCCL_STATUS(hcclAllReduce(dev_ptr, dev_ptr, data_size, hcclFloat32, hcclSum, hccl_comm, stream));

    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.

    // Copy the data back to the host memory.
    auto output_host_data = std::vector<float>(4);
    CHECK_HCCL_STATUS(hcclMemcpy(&output_host_data.front(), dev_ptr, data_size, hcclMemcpyDeviceToHost, stream));

    // There is no need to call hcclWaitForCompletion(stream), as device->host memcopy is blocking.

    // Check if the data has been reduced correctly.
    bool is_ok = true;
    for (size_t i = 0; i < input_host_data.size(); ++i) {
      // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
      if (std::abs(output_host_data[i] - static_cast<float>(mpi_size) * input_host_data[i]) > 1e-10f) {
        is_ok = false;
      }
    }

    log() << "Buffer " << input_host_data << " reduced to " << output_host_data << " which is "
          << (is_ok ? "fine." : "bad.") << std::endl;

    // Free up resources.
    CHECK_HCCL_STATUS(hcclFree(dev_ptr));

    // Destroy a HCCL communicator.
    CHECK_HCCL_STATUS(hcclCommDestroy(hccl_comm));

    // Clean up HCCL.
    CHECK_HCCL_STATUS(hcclDeviceFree(device_handle));

    // Clean up Open MPI.
    CHECK_MPI_STATUS(MPI_Finalize());

  } catch (const std::exception& ex) {
    log() << "error: " << ex.what() << std::endl;
    return -1;
  }

  return 0;
}