C API

Infrastructure

hcclStream_t

Opaque handle to HCCL stream. Translates directly to synStreamHandle from synapse_api.h

typedef synStreamHandle hcclStream_t;

hcclComm_t

Opaque handle to HCCL communicator.

hcclUniqueId

Unique Id to be used with hcclCommInitRank. All HCCL communicators within a communicator clique have the same Unique Id.

hcclDataType_t

The following HCCL data types are currently supported by the collective operations:

  • hcclFloat32, hcclFloat

  • hcclBfloat16 (added by Habana)

hcclResult_t

The following HCCL result types are currently supported:

  • hcclSuccess

  • hcclSystemError

  • hcclInvalidArgument

  • hcclInternalError

  • hcclInvalidUsage

hcclRedOp_t

The following Ops are supported:

  • hcclSum

Warning

hcclProd, hcclMax, and hcclMin are not supported by HCCL.

hcclMemcpyKind_t

Specifies the direction of memory copy transfer used in function hcclMemcpy. The following values are supported:

  • hcclMemcpyHostToDevice - Source is in the process memory, destination is a device address.

  • hcclMemcpyDeviceToHost - Source is in a device address, destination is in the process memory.

  • hcclMemcpyDeviceToDevice - Both source and destinations are device addresses.

Warning

hcclMemcpyDefault (inferring the address kind from pointers) is not supported by HCCL.

HCCL-specific API

Habana supports one device per process and one or multiple nodes. HCCL-specific API functions are listed below.

hcclBarrier

hcclResult_t hcclBarrier(hcclComm_t comm, hcclStream_t stream_handle)

Operation:

Hccl barrier provides a synchronization method between all ranks in the provided communicator.

Parameters:

Parameter

Description

comm

[in] The communicator on which to perform the barrier synchronization operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

NCCL-Compatible API

Habana supports one device per process and one or multiple nodes. The supported API functions are listed below.

hcclGetVersion

hcclResult_t hcclGetVersion(int* version)

Operation:

The hcclGetVersion function returns the version number of the currently linked HCCL library. This integer is coded with the MAJOR, MINOR and PATCH level of the HCCL library.

Parameters:

Parameter

Description

version

[out] Coded HCCL version.

Return Value:

The status of the operation.

hcclGetUniqueId

hcclResult_t hcclGetUniqueId(hcclUniqueId* uniqueId)

Operation:

One of two functions used for creating HCCL communicator. Creates a uniqueId required for hcclCommInitRank.

This function should be called only by one process within a new communicator. Obtained uniqueId is required for HCCL communicator initialization and must be distributed to other ranks using separate communication channel (e.g. MPI).

Parameters:

Parameter

Description

uniqueId

[out] commId used in hcclCommInitRank.

Return Value:

The status of the operation.

hcclCommInitRank

hcclResult_t hcclCommInitRank(hcclComm_t* comm, int nranks,
hcclUniqueId commId, int rank)

Operation:

Creates an HCCL communicator with given commId. This call is blocking and will return only after being called on every rank of new communicator.

commId should be obtained by hcclGetUniqueId on one of the ranks and distributed using separate communication channel (e.g. MPI).

Each HCCL rank of created communicator is associated with a single Gaudi device.

Parameters:

Parameter

Description

comm

[out] Handle of initialized HCCL communicator

nranks

[in] Total number of ranks in the created communicator clique. It can be 2, 4 and 8.

commId

[in] Communicator Id returned by hcclGetUniqueId. Communicators created using same Id belong to same communicator clique.

rank

[in] The rank of the current device/process.

Return Value:

The status of the operation.

hcclCommSynDevice

hcclResult_t hcclCommSynDevice(const hcclComm_t comm, int* device)

Operation:

Returns the Gaudi device ID associated with current process in the HCCL communicator comm. The device can be used to, for example, allocate device memory and so on. As there may be only one Gaudi device opened within a process, the returned value will always be 0.

Parameters:

Parameter

Description

comm

[in] The communicator created by hcclCommInitRank.

device

[out] Gaudi device ID associated with comm.

hcclCommCount

hcclResult_t hcclCommCount(const hcclComm_t comm, int* count)

Operation:

Returns the total number of ranks in the HCCL communicator comm.

Parameters:

Parameter

Description

comm

[in] The communicator.

count

[out] The total number of ranks.

hcclCommUserRank

hcclResult_t hcclCommUserRank(const hcclComm_t comm, int* rank)

Operation:

Returns the rank (assigned to current process in the HCCL communicator comm.

Parameters:

Parameter

Description

comm

[in] The communicator.

rank

[out] The rank.

hcclCommDestroy

hcclResult_t hcclCommDestroy(hcclComm_t comm)

Operation:

Destroys the HCCL communicator.

Parameters:

Parameter

Description

comm

[in] The communicator to destroy.

Return Value:

The status of the operation.

hcclBroadcast

hcclResult_t hcclBroadcast(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, int root, hcclComm_t comm,
hcclStream_t stream_handle))

Operation:

Broadcasts data from a single rank sendbuff to recvbuff of all others ranks that are part of HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Address of send buffer of root rank.

recvbuff

[in] Address of receive buffer of recieve rank.

count

[in] The number of elements.

datatype

[in] The datatype of the operation.

root

[in] The rank of the root of the broadcast operation.

comm

[in] The communicator on which to broadcast.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

hcclAllreduce

hcclResult_t hcclAllReduce(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t comm,
hcclStream_t stream_handle))

Operation:

Reduces data from sendbuff across the entire HCCL communicator comm and distributes the results to recvbuff of every rank that is part of comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

count

[in] The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

hcclReduce

hcclResult_t hcclReduce(const void* sendbuff, void* recvbuff, size_t
count, hcclDataType_t datatype, hcclRedOp_t op, int root, hcclComm_t
comm, hcclStream_t stream_handle)

Operation:

Reduces data from sendbuff across the entire HCCL communicator comm. Result will be put only in recvbuffer of rank designed as root.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into in root

count

[in]  The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

root

[in] The rank who will contain the result.

comm

[in] The com municator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

hcclReduceScatter

hcclResult_t hcclReduceScatter(const void* sendbuff, void* recvbuff,
size_t recvcount, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t
comm, hcclStream_t stream_handle)

Operation:

Reduce-Scatter data across the entire HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

recvcount

[in] The number of elements of the buffer that is reduced.

datatype

[in] The datatype of the operation.

op

[in] The reduction operation to perform (summation for example).

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

hcclAllGather

hcclResult_t hcclAllGather(const void* sendbuff, void* recvbuff,
size_t sendcount, hcclDataType_t datatype, hcclComm_t comm, hcclStream_t
stream_handle)

Operation:

Performs allgather across the entire HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

recvbuff

[out] Buffer that will be written into.

sendcount

[in] The number of elements of the buffer that is moved.

datatype

[in] The datatype of the operation.

comm

[in] The communicator on which to perform the reduction operation.

stream_handle

[in] HCCL stream handle.

Return Value:

The status of the operation.

hcclGroupStart

hcclResult_t hcclGroupStart();

Operation:

Start a group call

Parameters:

No parameters

Return Value:

The status of the operation.

hcclGroupEnd

hcclResult_t hcclGroupEnd();

Operation:

End a group call. Start execution of all calls collected since hcclGroupStart.

Parameters:

No parameters

Return Value:

The status of the operation.

hcclSend

hcclResult_t hcclSend(const void* sendbuff, size_t count, hcclDataType_t
datatype, int peer, hcclComm_t comm, hcclStream_t stream);

Operation:

Performs allgather across the entire HCCL communicator comm.

Parameters:

Parameter

Description

sendbuff

[in] Buffer that will be sent.

count

[in] The number of elements of the buffer that will be sent.

datatype

[in] The datatype of the operation.

peer

[in] The rank peer that the data should be sent to.

comm

[in] The communicator on which to perform the reduction operation.

stream

[in] HCCL stream handle.

Return Value:

The status of the operation.

hcclRecv

hcclResult_t hcclSend(const void* sendbuff, size_t count, hcclDataType_t
datatype, int peer, hcclComm_t comm, hcclStream_t stream);

hcclResult_t hcclRecv(void* recvbuff, size_t count, hcclDataType_t
datatype, int peer, hcclComm_t comm, hcclStream_t stream)

Operation:

Performs allgather across the entire HCCL communicator comm.

Parameters:

Parameter

Description

recvbuff

[in] Buffer that received data will be written to.

count

[in] Number of elements that should be received.

datatype

[in] The datatype of the operation.

peer

[in] The rank peer that the data should be received from.

comm

[in] The communicator on which to perform the reduction operation.

stream

[in] HCCL stream handle.

Return Value:

The status of the operation.

Example

The following C++ code is an example of setting up HCCL and performing hcclAllreduce on a buffer of 4 floats:

// C++ Standard Libraries
#include <iostream>
#include <exception>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iomanip>
#include <unistd.h>
// Open MPI
#include <mpi.h>

// HCCL :: Habana Collective Communications Library
#include <hccl/hccl.h>

// Synapse :: Habana Synapse training API
#include <synapse_api.h>

using namespace std;

#define HLS1H_MODULE_CNT 4
#define HLS1_MODULE_CNT 8

// ------------------------------------------------------------------------------------------------

#define CHECK_MPI_STATUS(x)                                                             \
{                                                                                       \
    const auto _res = (x);                                                              \
    if (_res != MPI_SUCCESS)                                                            \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +               \
                            "(): " #x " failed with code: " + std::to_string(_res)};    \
}

#define CHECK_HCCL_STATUS(x)                                                      \
{                                                                                 \
    const auto _res = (x);                                                        \
    if (_res != hcclSuccess)                                                      \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +         \
                            "(): " #x " failed: " + hcclGetErrorString(_res)};    \
}

#define CHECK_SYNAPSE_STATUS(x)                                                                    \
{                                                                                                  \
    const auto _res = (x);                                                                         \
    if (_res != synSuccess)                                                                        \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +                          \
                            "(): " #x " failed with synapse error: " + std::to_string((_res))};    \
}

std::ostream& log() { return std::cout; }

template <typename T>
std::ostream& operator<<(std::ostream& out, const std::vector<T>& data) {
out << '[';
if (!data.empty()) {
    out << data[0];
    for (size_t i = 1; i < data.size(); ++i) {
    out << " " << data[i];
    }
}
return out << ']';
}

std::string get_test_type() {
static const auto default_hccl_test_type = std::string{"Broadcast"};
const char* env_value = getenv("HCCL_EXAMPLE_TEST");
return (env_value != nullptr) ? std::string(env_value) : default_hccl_test_type;
}

std::string get_example_hcl_type() {
static const auto default_hcl_type = std::string{"HLS1"};
const char* env_value = getenv("HCL_TYPE");
return (env_value != nullptr) ? std::string(env_value) : default_hcl_type;
}

int get_example_test_root() {
static const auto default_hccl_test_root = 1;
const char* env_value = getenv("HCCL_EXAMPLE_TEST_ROOT");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_root;
}

int get_example_test_size() {
static const auto default_hccl_test_size = 25; /* 32M */
const char* env_value = getenv("HCCL_EXAMPLE_TEST_SIZE");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_size;
}

int get_example_test_count() {
const char* env_value = getenv("HCCL_EXAMPLE_TEST_COUNT");
return (env_value != nullptr) ? atoi(env_value) : -1;
}

int get_example_test_loop() {
static const auto default_hccl_test_loop = 1;
const char* env_value = getenv("HCCL_EXAMPLE_TEST_LOOP");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_loop;
}

bool get_example_test_profile() {
static const auto default_hccl_test_profile = false;
const char* env_value = getenv("HABANA_PROFILE");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_profile;
}

// ------------------------------------------------------------------------------------------------

int main(int argc, char* argv[]) {
try {
    log() << "Running HCCL Example :: A simple program demonstrating HCCL usage from C++" << std::endl;

    // Initialize the Open MPI execution context.
    CHECK_MPI_STATUS(MPI_Init(&argc, &argv));

    // Get MPI rank and size.
    int mpi_rank{};
    CHECK_MPI_STATUS(MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank));

    int mpi_size{};
    CHECK_MPI_STATUS(MPI_Comm_size(MPI_COMM_WORLD, &mpi_size));

    if (mpi_size < 2) {
    throw std::runtime_error{"There is only one MPI rank. Run this program using 'mpirun' or 'mpiexec'"};
    }

    // First acquire device and  create streams. In most cases of real world implementation this is done by framework,
    // bit here we are running using poor synapse.
    synDeviceId device_handle{};
    const synModuleId device_module_id =
        mpi_rank % (get_example_hcl_type() == "HLS1" ? HLS1_MODULE_CNT : HLS1H_MODULE_CNT);
    CHECK_SYNAPSE_STATUS(synDeviceAcquireByModuleId(&device_handle, device_module_id));

    // Create a new stream handle. There is no need to destroy it any way.
    synStreamHandle collective_stream{};
    synStreamHandle device_to_host_stream{};
    synStreamHandle host_to_device_stream{};

    CHECK_SYNAPSE_STATUS(synStreamCreate(&collective_stream, device_handle, STREAM_TYPE_NETWORK_COLLECTIVE, 0));
    CHECK_SYNAPSE_STATUS(synStreamCreate(&device_to_host_stream, device_handle, STREAM_TYPE_COPY_DEVICE_TO_HOST, 0));
    CHECK_SYNAPSE_STATUS(synStreamCreate(&host_to_device_stream, device_handle, STREAM_TYPE_COPY_HOST_TO_DEVICE, 0));

    // Generate Unique Id on rank 0 and propagate it to other ranks using Open MPI.
    hcclUniqueId unique_id{}, hccl_comm_world_id{};
    constexpr int master_mpi_rank = 0;

    hcclComm_t hccl_comm_world{};
    CHECK_HCCL_STATUS(hcclCommInitRank(&hccl_comm_world, mpi_size, hccl_comm_world_id, mpi_rank));

    if (mpi_rank == master_mpi_rank) {
    CHECK_HCCL_STATUS(hcclGetUniqueId(&unique_id));
    }

    CHECK_MPI_STATUS(MPI_Bcast(&unique_id, sizeof(unique_id), MPI_BYTE, master_mpi_rank, MPI_COMM_WORLD));

    // Create a new HCCL communicator.
    hcclComm_t hccl_comm{};
    CHECK_HCCL_STATUS(hcclCommInitRank(&hccl_comm, mpi_size, unique_id, mpi_rank));

    // Allocate some buffer on the HPU device.
    // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)

    /* If the user provides count env variable - use that for size calculation
    * Else, use default or provided test size env variable.
    */
    uint64_t count = get_example_test_count();
    size_t data_size;
    int size;
    if (count == -1) {
    size = get_example_test_size();
    data_size = 1 << size;
    count = data_size / sizeof(float);
    } else {
    data_size = count * sizeof(float);
    size = -1;
    }
    auto input_host_data = std::vector<float>(count, mpi_rank);
    const void* input_host_data_ptr = reinterpret_cast<void*>(input_host_data.data());

    uint64_t input_dev_ptr{};
    uint64_t output_dev_ptr{};

    CHECK_SYNAPSE_STATUS(synDeviceMalloc(device_handle, data_size, 0, 0, &input_dev_ptr));
    CHECK_SYNAPSE_STATUS(synDeviceMalloc(device_handle, data_size, 0, 0, &output_dev_ptr));
    CHECK_SYNAPSE_STATUS(synHostMap(device_handle, data_size, input_host_data_ptr));
    CHECK_SYNAPSE_STATUS(
        synMemCopyAsync(host_to_device_stream, (uint64_t)input_host_data_ptr, data_size, input_dev_ptr, HOST_TO_DRAM));
    CHECK_SYNAPSE_STATUS(synStreamSynchronize(host_to_device_stream));

    const int LOOP = get_example_test_loop();
    chrono::time_point<chrono::high_resolution_clock> t_begin, t_end;

    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.
    std::string test_type = get_test_type();
    if (test_type == "Broadcast") {
    int root = get_example_test_root();

    log() << "mpi_rank=" << mpi_rank << " Running Broadcast Test root=" << root << std::endl;
    for (int i = 0; i < LOOP; ++i) {
        CHECK_HCCL_STATUS(hcclBroadcast((const void*)input_dev_ptr, (void*)output_dev_ptr, input_host_data.size(),
                                        hcclFloat32, root, hccl_comm, collective_stream));
    }
    } else if (test_type == "Allreduce") {
    log() << "mpi_rank=" << mpi_rank << " Running Allreduce Test" << std::endl;

    if (get_example_test_profile()) {
        std::cout << "Profile On" << std::endl;
        CHECK_SYNAPSE_STATUS(synProfilerStart(synTraceDevice, device_handle));
    }

    t_begin = chrono::high_resolution_clock::now();
    for (int i = 0; i < LOOP; ++i) {
        // Perform an All-Reduce operation on the device buffer.
        CHECK_HCCL_STATUS(hcclAllReduce((const void*)input_dev_ptr, (void*)output_dev_ptr, input_host_data.size(),
                                        hcclFloat32, hcclSum, hccl_comm, collective_stream));
    }
    }

    CHECK_SYNAPSE_STATUS(synStreamSynchronize(collective_stream));
    t_end = chrono::high_resolution_clock::now();
    auto dt = chrono::duration<double>(t_end - t_begin).count();

    if (get_example_test_profile()) {
    CHECK_SYNAPSE_STATUS(synProfilerStop(synTraceDevice, device_handle));
    CHECK_SYNAPSE_STATUS(
        synProfilerGetTrace(synTraceDevice, device_handle, synTraceFormatTEF, nullptr, nullptr, nullptr));
    }

    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.

    // Copy the data back to the host memory.
    auto output_host_data = std::vector<float>(input_host_data.size());
    const void* output_host_data_ptr = reinterpret_cast<void*>(output_host_data.data());

    CHECK_SYNAPSE_STATUS(synHostMap(device_handle, data_size, output_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synMemCopyAsync(device_to_host_stream, output_dev_ptr, data_size,
                                        (uint64_t)output_host_data_ptr, DRAM_TO_HOST));
    CHECK_SYNAPSE_STATUS(synStreamSynchronize(device_to_host_stream));

    // There is no need to call hcclWaitForCompletion(stream), as device->host memcopy is blocking.

    // Check if the data has been reduced correctly.
    bool is_ok = true;
    if (test_type == "Broadcast") {
    int root = get_example_test_root();

    for (size_t i = 0; i < input_host_data.size(); ++i) {
        // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
        if (std::abs(output_host_data[i] - (float)(root)) != 0) {
        is_ok = false;
        }
    }

    log() << "Broadcast mpi_rank=" << mpi_rank << " loop=" << LOOP << " size=" << size << " count=" << count
            << " <float>"
            << " Input Buffer [" << input_host_data[0] << " " << input_host_data[1] << " " << input_host_data[2] << " "
            << input_host_data[3] << " ...]"
            << " Output Buffer [" << output_host_data[0] << " " << output_host_data[1] << " " << output_host_data[2]
            << " " << output_host_data[3] << " ...]"
            << " which is " << (is_ok ? "fine." : "bad.") << std::endl;
    } else if (test_type == "Allreduce") {
    for (size_t i = 0; i < input_host_data.size(); ++i) {
        // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)

        if (std::abs(output_host_data[i] - static_cast<float>(mpi_size) * (mpi_size - 1) / 2) > 1e-10f) {
        is_ok = false;
        }
    }

    log() << "Allreduce mpi_rank=" << mpi_rank << " size=" << size << " count=" << count << " <float>"
            << " Input Buffer [" << input_host_data[0] << " " << input_host_data[1] << " " << input_host_data[2] << " "
            << input_host_data[3] << " ...]"
            << " reduced to Output Buffer [" << output_host_data[0] << " " << output_host_data[1] << " "
            << output_host_data[2] << " " << output_host_data[3] << " ...]"
            << " which is " << (is_ok ? "fine." : "bad.") << std::endl;

    cout << "AllReduce[" << mpi_rank << "] with data_size=" << data_size << ", loop=" << LOOP << ": " << std::fixed
        << std::setprecision(3)
        << ((((double)LOOP * (2.0 * (double)data_size / mpi_size * (mpi_size - 1))) / 1e6) / dt) << " MB/s/card in "
        << dt << " seconds" << endl;
    }

    // Destroy a HCCL communicator.
    CHECK_HCCL_STATUS(hcclCommDestroy(hccl_comm));
    CHECK_HCCL_STATUS(hcclCommDestroy(hccl_comm_world));

    CHECK_SYNAPSE_STATUS(synHostUnmap(device_handle, input_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synHostUnmap(device_handle, output_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synDeviceFree(device_handle, input_dev_ptr, 0));
    CHECK_SYNAPSE_STATUS(synDeviceFree(device_handle, output_dev_ptr, 0));

    // Clean up HCCL.
    CHECK_SYNAPSE_STATUS(synDeviceRelease(device_handle));

    // Clean up Open MPI.
    CHECK_MPI_STATUS(MPI_Finalize());

} catch (const std::exception& ex) {
    log() << "error: " << ex.what() << std::endl;
    return -1;
}
return 0;
}