1. Habana Collective Communications Library (HCCL) API Reference

1.1. Overview

Habana Collective Communications Library (HCCL) is Habana’s emulation layer of the NVIDIA Collective Communication Library (NCCL). It operates on top of Habana Communication Library (HCL) which is included in the SynapseAI® Software library. See Habana Communication Library (HCL) API Reference.

HCCL provides the same list of collective communication primitives as NCCL, imitating the NCCL interface for user convenience. HCCL also allows for point-to-point communication.

The following lists the supported collective primitives:

  • AllReduce

  • Broadcast

  • Reduce

  • AllGather

  • ReduceScatter

  • Send (P2P)

  • Recv (P2P)

HCCL has been written for seamless integration with Habana TensorFlow modules. It allows sharing device ownership and synchronization mechanism with Habana TensorFlow modules. It also provides a function set for interacting with the device (moving data, allocating memory, etc.)

Warning

All HCL-related restrictions and its usage also apply to HCCL. See Habana Communication Library (HCL) API Reference.

1.2. Using HCCL

1.2.1. HCCL Library

The HCCL library is provided as a dynamically-linked library named libhccl.so.X.Y.Z, where X.Y.Z denotes Abseil compatible with a specific TensorFlow version. For instance, a library named libhccl.so.2.4.1 has Abseil ABI compatible with TensorFlow 2.4.1. However, the HCCL library does not depend on TensorFlow and technically may be used without it.

Note

Library name suffix as well as Abseil will be removed in future releases.

To access HCCL C API from C/C++:

#include <hccl/hccl.h>

This header defines HCCL symbols, for example hcclGetVersion function. For symbols which have their counterpart in NCCL API, portability macros are defined as well:

#define ncclGetVersion hcclGetVersion

1.2.2. MPI

Throughout this document, OpenMPI usage is provided in all examples. However, HCCL does not have an OpenMPI dependency and can be built and used without it.

In HCCL, application MPI (or equivalent) is required for: # Broadcasting unique_id to all workers. # Spawn all the worker processes on multiple nodes.

The below is an example of running HCCL application with MPI:

mpirun -np 8 --tag-output my_program

In addition, any program using both HCCL and MPI must initialize MPI execution context prior to using MPI for broadcasting unique_id.

MPI_Init(&argc, &argv);

1.2.3. HCCL Runtime Initialization

Before the HCCL communicator is created, the Gaudi device needs to be acquired using Synapse API calls.

synDeviceId device_handle{};
const synModuleId device_module_id {mpi_rank % MODULES_PER_NODE};
synDeviceAcquireByModuleId(&device_handle, device_module_id);

This is a simple flow for opening the device. module_id should be a number in range [0, MODULES_PER_NODE-1] and mpi_rank is used to assign different devices for every process.

MODULES_PER_NODE variable is the number of devices available on single host.

Prior to exiting, the program should release the device in Synapse by calling synDeviceRelease passing a handle received from synDeviceAcquireByModuleId call.

synDeviceRelease(device_handle);

Warning

Currently, HCCL supports only single device per process, and by default all HCCL calls use the device that was acquired first within the given process context.

1.2.4. HCCL Environment Variables

During the HCCL initialization phase, the environment variables presented in the table below are used.

Parameter

Short Description

Value if not set

HCCL_BASE_PORT

Port range used for sideband communication.

45000

HCCL_SOCKET_IFNAME

Prefix of name of network interface used for sideband communication.

Auto-detected (see description)

HCL_CONFIG_PATH

Path for configuration file for HCL. See Habana Communication Library (HCL) API Reference.

Ignored

HCL_COMM_ID

IP address and port of rank used as a coordinator for bootstrap network. Syntax is: “{IP}:{PORT}”

Ignored

1.2.4.1. HCCL_BASE_PORT

HCCL_BASE_PORT defines the beginning of the port range that should be used for HCCL sideband TCP communication. The ports used by HCCL are in range [HCCL_BASE_PORT, HCCL_BASE_PORT+100].

If HCL_CONFIG_PATH is not provided, ports from this range are also assigned for the underlying HCL communicator.

1.2.4.2. HCCL_SOCKET_IFNAME

HCCL_SOCKET_IFNAME defines the prefix of the network interface name that is used for HCCL sideband TCP communication. If not set, the first network interface with a name that does not start with lo or docker will be used.

1.2.4.3. HCL_CONFIG_PATH

HCL_CONFIG_PATH defines the location of the HCL configuration file (Habana Communication Library (HCL) API Reference). If set, make sure that topology described in the HCL config file matches the communicator you want to create in HCCL.

Warning

If HCL_CONFIG_PATH is set, HCL uses either the port provided in the JSON config file or its own default if no HCL_PORT is provided. The port will not be taken from range based on HCCL_BASE_PORT.

1.2.4.4. HCL_COMM_ID

HCL_COMM_ID defines the IP address and port used for bootstrap network. The value represents the socket on which coordinator rank will wait for other ranks to join.

If HCL_COMM_ID is not defined, HCCL will rely on the socket provided by unique_id.

Note

If a variable is used, identical value must be set for every worker.

1.2.5. HCCL Streams and Asynchronicity of Issued Operations

Synchronization between communication and compute is done using Synapse API stream calls. For calling HCCL, you first need to acquire stream from Synapse:

synStreamHandle collective_stream{};
synStreamHandle device_to_host_stream{};
synStreamHandle host_to_device_stream{};

synStreamCreate(&collective_stream, device_handle, STREAM_TYPE_NETWORK_COLLECTIVE, 0);
synStreamCreate(&device_to_host_stream, device_handle, STREAM_TYPE_COPY_DEVICE_TO_HOST, 0);
synStreamCreate(&host_to_device_stream, device_handle, STREAM_TYPE_COPY_HOST_TO_DEVICE, 0);

In the above example, synStreamCreate is called many times for obtaining different streams for different purposes.

All collective operations are asynchronous - implemented as non-blocking calls. After an asynchronous call, another collective operation may be called immediately after as long as it uses the same Synapse stream.

When the next operation uses a different stream, synchronization needs to be added. It can be done either in a blocking manner, using synStreamSynchronize, or in a non-blocking manner, using synEventRecord and synStreamWaitEvent pair.

...
hcclAllReduce(input_dev_ptr, output_dev_ptr, elem_cnt,
    hcclFloat32, hcclSum, hccl_comm, collective_stream);
// Create event that will mark end of *hcclAllReduce* operation
synEventHandle allreduce_event;
synEventCreate(&allreduce_event, device_handle, 0);
synEventRecord(allreduce_event, collective_stream, 0);
// Signal that all the work on *device_to_host_stream* should wait for *allreduce_event*
synStreamWaitEvent(device_to_host_stream,  allreduce_event, 0);
// Schedule copy request from device to host
synMemCopyAsync(device_to_host_stream, output_dev_ptr, data_size, host_buffer_ptr, DRAM_TO_HOST)
// Wait (in blocking manner) for data to reach the host
synStreamSynchronize(device_to_host_stream);
...

The above example shows how to synchronize calls to hcclAllReduce by copying data to host. After all operations are submitted on streams, blocking synStreamSynchronize is called. This is done for blocking wait until all the data is copied on the host. More information can be found in the SynapseAI Training API Reference documentation.

1.2.6. Scale-Out via Host-NIC over TCP

Multi-node scale-out for Gaudi accelerator devices via host NIC interfaces is enabled. This functionality is provided by the HCCL library only. The data transfer between nodes happens over TCP connections between host NIC interfaces. HCCL can use multiple TCP connections between communication nodes in order to maximize bandwidth utilization.

1.2.6.1. Configuration Knobs

HCCL exports several environment variables that control the behavior of the TCP connections that are used for scale-out communication over host NICs. The table below lists the environment variables needed.

Environment Variable

Description

HCCL_OVER_TCP

Enables scale-out communications over TCP. Possible values are 0 (disable) or 1 (enable). Default value is 0.

HCCL_SOCKET_IFNAME

Identifies the network interface(s) that should be used for scale-out comms.

HCCL_DEFAULT_NIC_COUNT

Limits the number of network interfaces that are used by HCCL. Select subset (first-N) of all network interfaces that are detected based on HCCL_SOCKET_IFNAME setting. Default value is 4.

HCL_COMM_ID

Identifies the root process (rank 0) of the global communicator group. Typically set to <IPaddress:port> the IP address of the network interface used by the root. This must be set for all HCCL processes when there is no alternate network to broadcast this.

HCCL_SOCKET_NTHREADS

Specifies the total number of CPU threads (per HCCL process) created to handle the data communication over TCP sockets.

HCCL_NSOCKS_PERTHREAD

Specifies the number of TCP socket connections served by single CPU thread.

1.2.6.2. Usage

To use scale-out communication over host NICs, at least 2 nodes in the global communicator is required. One HCCL process is run for every Gaudi accelerator device – in a setup with 2 nodes and 8 Gaudi devices on each node, there would be a total of 16 HCCL processes (ranks), 8 on each node. The recommended method is to launch each process with appropriate environment settings; some variables such as HCCL_SOCKET_IFNAME take different values for different processes.

1.2.6.3. Example

Run all-reduce on 2 nodes with 8 ranks each:

$HCCL_SOCKET_IFNAME=eth1,eth2,eth3,eth4 HCCL_DEFAULT_NIC_COUNT=4
HCCL_COMM_ID=10.111.14.155:9696 HCCL_SOCKET_NTHREADS=2 HCCL_NSOCKS_PERTHREAD=3
HCCL_OVER_TCP=1 python3.6 run_hccl_demo.py -nranks 16 -node_id 1 -test all_reduce

1.3. C API

1.3.1. Infrastructure

1.3.1.1. hcclStream_t

Opaque handle to HCCL stream. Translates directly to synStreamHandle from synapse_api.h

typedef synStreamHandle hcclStream_t;

1.3.1.2. hcclComm_t

Opaque handle to HCCL communicator.

1.3.1.3. hcclUniqueId

Unique Id to be used with hcclCommInitRank. All HCCL communicators within a communicator clique have the same Unique Id.

1.3.1.4. hcclDataType_t

The following HCCL data types are currently supported by the collective operations:

  • hcclFloat32, hcclFloat

  • hcclBfloat16 (added by Habana)

1.3.1.5. hcclResult_t

The following HCCL result types are currently supported:

  • hcclSuccess

  • hcclSystemError

  • hcclInvalidArgument

  • hcclInternalError

  • hcclInvalidUsage

1.3.1.6. hcclRedOp_t

The following Ops are supported:

  • hcclSum

Warning

hcclProd, hcclMax, and hcclMin are not supported by HCCL.

1.3.1.7. hcclMemcpyKind_t

Specifies the direction of memory copy transfer used in function hcclMemcpy. The following values are supported:

  • hcclMemcpyHostToDevice - Source is in the process memory, destination is a device address.

  • hcclMemcpyDeviceToHost - Source is in a device address, destination is in the process memory.

  • hcclMemcpyDeviceToDevice - Both source and destinations are device addresses.

Warning

hcclMemcpyDefault (inferring the address kind from pointers) is not supported by HCCL.

1.3.2. HCCL-specific API

Habana supports one device per process and one or multiple nodes. The supported API functions make the minimum set of API functions that are needed to run HCCL on Habana® Gaudi® devices.

1.3.2.1. hcclDeviceInit

Operation:

Initializes HCCL runtime and opens an HPU device. The device is reused if already opened earlier in the process. If a device has been already opened by TensorFlow with Habana support modules, HCCL will reuse the memory allocator provided by those modules. Otherwise, HCCL will use a simple implementation of memory allocator. ‘ordinal’ parameter is currently unused and must be set to 0.

Parameters:

Parameter

Description

device_handle

[out] Device handle of an acquired device.

ordinal

[in] Reserved for further use: must be 0.

Return Value:

The status of the operation.

1.3.2.2. hcclDeviceFree

Operation:

Releases the HPU device and cleans up HCCL runtime.

Parameters:

Parameter

Description

device_handle

[in] Device handle to be closed.

Return Value:

The status of the operation.

1.3.2.3. hcclMalloc

Operation:

Allocates memory on the HPU device.

Parameters:

Parameter

Description

dev_ptr

[out] Allocatod device address.

size

[in] Size in bytes to be allocated.

Return Value:

The status of the operation.

1.3.2.4. hcclFree

Operation:

Frees memory from the HPU device.

Parameters:

Parameter

Description

dev_ptr

[in] Device address to be freed.

Return Value:

The status of the operation.

1.3.2.5. hcclMemcpy

Operation:

Copies data between host and device.

Parameters:

Parameter

Description

dst

[out] Destination address. Can be host or device address depending on kind parameter.

src

[in] Source address. Can be host or device address depending on kind parameter.

count

[in] Size in bytes to be copied.

kind

[in] Direction of the copy operation, i.e. Placement of dst and src addresses on host/device.

Return Value:

The status of the operation.

1.3.2.6. hcclWaitForCompletion

Operation:

A blocking call which waits until issued asynchronous memory copy and collective operations are finished.

Parameters:

Parameter

Description

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

1.3.3. NCCL-Compatible API

Habana supports one device per process and one single node (up to 8 devices). The supported API functions are listed below. Other HCCL API functions are no-op. The supported API functions make the minimum set of API functions that are needed to run HCCL on Gaudi devices.

1.3.3.1. hcclGetUniqueId

hcclResult_t hcclGetUniqueId(hcclUniqueId* uniqueId)

Operation:

One of two functions used for creating HCCL communicator. Creates a uniqueId required for hcclCommInitRank.

This function should be called only by one process within a new communicator. Obtained uniqueId is required for HCCL communicator initialization and must be distributed to other ranks using separate communication channel (e.g. MPI).

Parameters:

Parameter

Description

uniqueId

[out] commId used in hcclCommInitRank.

Return Value:

The status of the operation.

1.3.3.2. hcclCommInitRank

hcclResult_t hcclCommInitRank(hcclComm_t* comm, int nranks,
hcclUniqueId commId, int rank)

Operation:

Creates an HCCL communicator with given commId. This call is blocking and will return only after being called on every rank of new communicator.

commId should be obtained by hcclGetUniqueId on one of the ranks and distributed using separate communication channel (e.g. MPI).

Each HCCL rank of created communicator is associated with a single Gaudi device.

Parameters:

Parameter

Description

comm

[out] Handle of initialized HCCL communicator

nranks

[in] Total number of ranks in the created communicator clique. It can be 2, 4 and 8.

commId

[in] Communicator Id returned by hcclGetUniqueId. Communicators created using same Id belong to same communicator clique.

rank

[in] The rank of the current device/process.

Return Value:

The status of the operation.

1.3.3.3. hcclCommCuDevice

hcclResult_t hcclCommCuDevice(const hcclComm_t comm, int* device)

Operation:

Returns the Gaudi device ID associated with current process in the HCCL communicator comm. The device can be used to, for example, allocate device memory and so on. As there may be only one Gaudi device opened within a process, the returned value will always be 0.

Parameters:

Parameter

Description

comm

[in] The communicator created by hcclCommInitRank.

device

[out] Gaudi device ID associated with comm.

1.3.3.4. hcclCommCount

hcclResult_t hcclCommCount(const hcclComm_t comm, int* count)

Operation:

Returns the total number of ranks in the HCCL communicator comm.

Parameters:

Parameter

Description

comm

[in] The communicator.

count

[out] The total number of ranks.

1.3.3.5. hcclCommUserRank

hcclResult_t hcclCommUserRank(const hcclComm_t comm, int* rank)

Operation:

Returns the rank (assigned to current process in the HCCL communicator comm.

Parameters:

Parameter

Description

comm

[in] The communicator.

rank

[out] The rank.

1.3.3.6. hcclCommDestroy

hcclResult_t hcclCommDestroy(hcclComm_t comm)

Operation:

Destroys the HCCL communicator.

Parameters:

Parameter

Description

comm

[in] The communicator to destroy.

Return Value:

The status of the operation.

1.3.3.7. hcclBroadcast

hcclResult_t hcclBroadcast(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, int root, hcclComm_t comm,
hcclStream_t stream_handle))

Operation:

Broadcasts data from a single rank sendbuff to recvbuff of all others ranks that are part of HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Address of send buffer of root rank.

recvbuff

[in] Address of receive buffer of recieve rank.

count

[in] The number of elements.

datatype

[in] The datatype of the operation.

root

[in] The rank of the root of the broadcast operation.

comm

[in] The communicator on which to broadcast.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

1.3.3.8. hcclAllreduce

hcclResult_t hcclAllReduce(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t comm,
hcclStream_t stream_handle))

Operation:

Reduces data from sendbuff across the entire HCCL communicator comm and distributes the results to recvbuff of every rank that is part of comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

count

[in] The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

1.3.3.9. hcclReduce

hcclResult_t hcclReduce(const void* sendbuff, void* recvbuff, size_t
count, hcclDataType_t datatype, hcclRedOp_t op, int root, hcclComm_t
comm, hcclStream_t stream_handle)

Operation:

Reduces data from sendbuff across the entire HCCL communicator comm. Result will be put only in recvbuffer of rank designed as root.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into in root

count

[in]  The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

root

[in] The rank who will contain the result.

comm

[in] The com municator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

1.3.3.10. hcclReduceScatter

hcclResult_t hcclReduceScatter(const void* sendbuff, void* recvbuff,
size_t recvcount, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t
comm, hcclStream_t stream_handle)

Operation:

Reduce-Scatter data across the entire HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

recvcount

[in] The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

1.3.3.11. hcclAllGather

hcclResult_t hcclAllGather(const void* sendbuff, void* recvbuff,
size_t sendcount, hcclDataType_t datatype, hcclComm_t comm, hcclStream_t
stream_handle)

Operation:

Performs allgather across the entire HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

sendcount

[in] The number of elements of the buffer that is moved.

datatype

[in] The datatype of the operation.

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

1.3.4. Example

The following C++ code is an example of setting up HCCL and performing hcclAllreduce on a buffer of 4 floats:

// C++ Standard Libraries
#include <iostream>
#include <exception>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iomanip>
#include <unistd.h>
// Open MPI (v4.0.2)
#include <mpi.h>

// HCCL :: Habana Collective Communications Library
#include <hccl/hccl.h>

// Synapse :: Habana Synapse training API
#include <synapse_api.h>

using namespace std;

#define HLS1H_MODULE_CNT 4
#define HLS1_MODULE_CNT 8

// ------------------------------------------------------------------------------------------------

#define CHECK_MPI_STATUS(x)                                                             \
{                                                                                       \
    const auto _res = (x);                                                              \
    if (_res != MPI_SUCCESS)                                                            \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +               \
                            "(): " #x " failed with code: " + std::to_string(_res)};    \
}

#define CHECK_HCCL_STATUS(x)                                                      \
{                                                                                 \
    const auto _res = (x);                                                        \
    if (_res != hcclSuccess)                                                      \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +         \
                            "(): " #x " failed: " + hcclGetErrorString(_res)};    \
}

#define CHECK_SYNAPSE_STATUS(x)                                                                    \
{                                                                                                  \
    const auto _res = (x);                                                                         \
    if (_res != synSuccess)                                                                        \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +                          \
                            "(): " #x " failed with synapse error: " + std::to_string((_res))};    \
}

std::ostream& log() { return std::cout; }

template <typename T>
std::ostream& operator<<(std::ostream& out, const std::vector<T>& data) {
out << '[';
if (!data.empty()) {
    out << data[0];
    for (size_t i = 1; i < data.size(); ++i) {
    out << " " << data[i];
    }
}
return out << ']';
}

std::string get_test_type() {
static const auto default_hccl_test_type = std::string{"Broadcast"};
const char* env_value = getenv("HCCL_EXAMPLE_TEST");
return (env_value != nullptr) ? std::string(env_value) : default_hccl_test_type;
}

std::string get_example_hcl_type() {
static const auto default_hcl_type = std::string{"HLS1"};
const char* env_value = getenv("HCL_TYPE");
return (env_value != nullptr) ? std::string(env_value) : default_hcl_type;
}

int get_example_test_root() {
static const auto default_hccl_test_root = 1;
const char* env_value = getenv("HCCL_EXAMPLE_TEST_ROOT");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_root;
}

int get_example_test_size() {
static const auto default_hccl_test_size = 25; /* 32M */
const char* env_value = getenv("HCCL_EXAMPLE_TEST_SIZE");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_size;
}

int get_example_test_count() {
const char* env_value = getenv("HCCL_EXAMPLE_TEST_COUNT");
return (env_value != nullptr) ? atoi(env_value) : -1;
}

int get_example_test_loop() {
static const auto default_hccl_test_loop = 1;
const char* env_value = getenv("HCCL_EXAMPLE_TEST_LOOP");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_loop;
}

bool get_example_test_profile() {
static const auto default_hccl_test_profile = false;
const char* env_value = getenv("HABANA_PROFILE");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_profile;
}

// ------------------------------------------------------------------------------------------------

int main(int argc, char* argv[]) {
try {
    log() << "Running HCCL Example :: A simple program demonstrating HCCL usage from C++" << std::endl;

    // Initialize the Open MPI execution context.
    CHECK_MPI_STATUS(MPI_Init(&argc, &argv));

    // Get MPI rank and size.
    int mpi_rank{};
    CHECK_MPI_STATUS(MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank));

    int mpi_size{};
    CHECK_MPI_STATUS(MPI_Comm_size(MPI_COMM_WORLD, &mpi_size));

    if (mpi_size < 2) {
    throw std::runtime_error{"There is only one MPI rank. Run this program using 'mpirun' or 'mpiexec'"};
    }

    // First acquire device and  create streams. In most cases of real world implementation this is done by framework,
    // bit here we are running using poor synapse.
    synDeviceId device_handle{};
    const synModuleId device_module_id =
        mpi_rank % (get_example_hcl_type() == "HLS1" ? HLS1_MODULE_CNT : HLS1H_MODULE_CNT);
    CHECK_SYNAPSE_STATUS(synDeviceAcquireByModuleId(&device_handle, device_module_id));

    // Create a new stream handle. There is no need to destroy it any way.
    synStreamHandle collective_stream{};
    synStreamHandle device_to_host_stream{};
    synStreamHandle host_to_device_stream{};

    CHECK_SYNAPSE_STATUS(synStreamCreate(&collective_stream, device_handle, STREAM_TYPE_NETWORK_COLLECTIVE, 0));
    CHECK_SYNAPSE_STATUS(synStreamCreate(&device_to_host_stream, device_handle, STREAM_TYPE_COPY_DEVICE_TO_HOST, 0));
    CHECK_SYNAPSE_STATUS(synStreamCreate(&host_to_device_stream, device_handle, STREAM_TYPE_COPY_HOST_TO_DEVICE, 0));

    // Generate Unique Id on rank 0 and propagate it to other ranks using Open MPI.
    hcclUniqueId unique_id{}, hccl_comm_world_id{};
    constexpr int master_mpi_rank = 0;

    hcclComm_t hccl_comm_world{};
    CHECK_HCCL_STATUS(hcclCommInitRank(&hccl_comm_world, mpi_size, hccl_comm_world_id, mpi_rank));

    if (mpi_rank == master_mpi_rank) {
    CHECK_HCCL_STATUS(hcclGetUniqueId(&unique_id));
    }

    CHECK_MPI_STATUS(MPI_Bcast(&unique_id, sizeof(unique_id), MPI_BYTE, master_mpi_rank, MPI_COMM_WORLD));

    // Create a new HCCL communicator.
    hcclComm_t hccl_comm{};
    CHECK_HCCL_STATUS(hcclCommInitRank(&hccl_comm, mpi_size, unique_id, mpi_rank));

    // Allocate some buffer on the HPU device.
    // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)

    /* If the user provides count env variable - use that for size calculation
    * Else, use default or provided test size env variable.
    */
    uint64_t count = get_example_test_count();
    size_t data_size;
    int size;
    if (count == -1) {
    size = get_example_test_size();
    data_size = 1 << size;
    count = data_size / sizeof(float);
    } else {
    data_size = count * sizeof(float);
    size = -1;
    }
    auto input_host_data = std::vector<float>(count, mpi_rank);
    const void* input_host_data_ptr = reinterpret_cast<void*>(input_host_data.data());

    uint64_t input_dev_ptr{};
    uint64_t output_dev_ptr{};

    CHECK_SYNAPSE_STATUS(synDeviceMalloc(device_handle, data_size, 0, 0, &input_dev_ptr));
    CHECK_SYNAPSE_STATUS(synDeviceMalloc(device_handle, data_size, 0, 0, &output_dev_ptr));
    CHECK_SYNAPSE_STATUS(synHostMap(device_handle, data_size, input_host_data_ptr));
    CHECK_SYNAPSE_STATUS(
        synMemCopyAsync(host_to_device_stream, (uint64_t)input_host_data_ptr, data_size, input_dev_ptr, HOST_TO_DRAM));
    CHECK_SYNAPSE_STATUS(synStreamSynchronize(host_to_device_stream));

    const int LOOP = get_example_test_loop();
    chrono::time_point<chrono::high_resolution_clock> t_begin, t_end;

    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.
    std::string test_type = get_test_type();
    if (test_type == "Broadcast") {
    int root = get_example_test_root();

    log() << "mpi_rank=" << mpi_rank << " Running Broadcast Test root=" << root << std::endl;
    for (int i = 0; i < LOOP; ++i) {
        CHECK_HCCL_STATUS(hcclBroadcast((const void*)input_dev_ptr, (void*)output_dev_ptr, input_host_data.size(),
                                        hcclFloat32, root, hccl_comm, collective_stream));
    }
    } else if (test_type == "Allreduce") {
    log() << "mpi_rank=" << mpi_rank << " Running Allreduce Test" << std::endl;

    if (get_example_test_profile()) {
        std::cout << "Profile On" << std::endl;
        CHECK_SYNAPSE_STATUS(synProfilerStart(synTraceDevice, device_handle));
    }

    t_begin = chrono::high_resolution_clock::now();
    for (int i = 0; i < LOOP; ++i) {
        // Perform an All-Reduce operation on the device buffer.
        CHECK_HCCL_STATUS(hcclAllReduce((const void*)input_dev_ptr, (void*)output_dev_ptr, input_host_data.size(),
                                        hcclFloat32, hcclSum, hccl_comm, collective_stream));
    }
    }

    CHECK_SYNAPSE_STATUS(synStreamSynchronize(collective_stream));
    t_end = chrono::high_resolution_clock::now();
    auto dt = chrono::duration<double>(t_end - t_begin).count();

    if (get_example_test_profile()) {
    CHECK_SYNAPSE_STATUS(synProfilerStop(synTraceDevice, device_handle));
    CHECK_SYNAPSE_STATUS(
        synProfilerGetTrace(synTraceDevice, device_handle, synTraceFormatTEF, nullptr, nullptr, nullptr));
    }

    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.

    // Copy the data back to the host memory.
    auto output_host_data = std::vector<float>(input_host_data.size());
    const void* output_host_data_ptr = reinterpret_cast<void*>(output_host_data.data());

    CHECK_SYNAPSE_STATUS(synHostMap(device_handle, data_size, output_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synMemCopyAsync(device_to_host_stream, output_dev_ptr, data_size,
                                        (uint64_t)output_host_data_ptr, DRAM_TO_HOST));
    CHECK_SYNAPSE_STATUS(synStreamSynchronize(device_to_host_stream));

    // There is no need to call hcclWaitForCompletion(stream), as device->host memcopy is blocking.

    // Check if the data has been reduced correctly.
    bool is_ok = true;
    if (test_type == "Broadcast") {
    int root = get_example_test_root();

    for (size_t i = 0; i < input_host_data.size(); ++i) {
        // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
        if (std::abs(output_host_data[i] - (float)(root)) != 0) {
        is_ok = false;
        }
    }

    log() << "Broadcast mpi_rank=" << mpi_rank << " loop=" << LOOP << " size=" << size << " count=" << count
            << " <float>"
            << " Input Buffer [" << input_host_data[0] << " " << input_host_data[1] << " " << input_host_data[2] << " "
            << input_host_data[3] << " ...]"
            << " Output Buffer [" << output_host_data[0] << " " << output_host_data[1] << " " << output_host_data[2]
            << " " << output_host_data[3] << " ...]"
            << " which is " << (is_ok ? "fine." : "bad.") << std::endl;
    } else if (test_type == "Allreduce") {
    for (size_t i = 0; i < input_host_data.size(); ++i) {
        // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)

        if (std::abs(output_host_data[i] - static_cast<float>(mpi_size) * (mpi_size - 1) / 2) > 1e-10f) {
        is_ok = false;
        }
    }

    log() << "Allreduce mpi_rank=" << mpi_rank << " size=" << size << " count=" << count << " <float>"
            << " Input Buffer [" << input_host_data[0] << " " << input_host_data[1] << " " << input_host_data[2] << " "
            << input_host_data[3] << " ...]"
            << " reduced to Output Buffer [" << output_host_data[0] << " " << output_host_data[1] << " "
            << output_host_data[2] << " " << output_host_data[3] << " ...]"
            << " which is " << (is_ok ? "fine." : "bad.") << std::endl;

    cout << "AllReduce[" << mpi_rank << "] with data_size=" << data_size << ", loop=" << LOOP << ": " << std::fixed
        << std::setprecision(3)
        << ((((double)LOOP * (2.0 * (double)data_size / mpi_size * (mpi_size - 1))) / 1e6) / dt) << " MB/s/card in "
        << dt << " seconds" << endl;
    }

    // Destroy a HCCL communicator.
    CHECK_HCCL_STATUS(hcclCommDestroy(hccl_comm));
    CHECK_HCCL_STATUS(hcclCommDestroy(hccl_comm_world));

    CHECK_SYNAPSE_STATUS(synHostUnmap(device_handle, input_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synHostUnmap(device_handle, output_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synDeviceFree(device_handle, input_dev_ptr, 0));
    CHECK_SYNAPSE_STATUS(synDeviceFree(device_handle, output_dev_ptr, 0));

    // Clean up HCCL.
    CHECK_SYNAPSE_STATUS(synDeviceRelease(device_handle));

    // Clean up Open MPI.
    CHECK_MPI_STATUS(MPI_Finalize());

} catch (const std::exception& ex) {
    log() << "error: " << ex.what() << std::endl;
    return -1;
}
return 0;
}