C APIs
On this Page
C APIs¶
Infrastructure¶
hcclComm_t¶
Opaque handle to HCCL communicator.
hcclUniqueId¶
Unique Id to be used with hcclCommInitRank.
All HCCL communicators within a communicator clique have the same uniqueId.
hcclDataType_t¶
The following HCCL data types are currently supported by the collective operations:
- hcclFloat32, hcclFloat 
- hcclBfloat16 
- hcclFloat16 
- hcclInt32, hcclInt 
- hcclUint32 
hcclResult_t¶
The following HCCL result types are currently supported:
- hcclSuccess 
- hcclSystemError 
- hcclInvalidArgument 
- hcclInternalError 
- hcclInvalidUsage 
hcclRedOp_t¶
The following ops are supported:
- hcclSum 
- hcclMin 
- hcclMax 
Warning
hcclProd and hcclAvg are not supported by HCCL.
HCCL-specific APIs¶
Intel Gaudi supports one device per process and one or multiple nodes. HCCL-specific API functions are listed below.
hcclAlltoAll¶
hcclResult_t hcclAlltoAll(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, hcclComm_t comm,
synStreamHandle stream_handle))
Operation:
Sends data from all to all processes.
Parameters:
| Parameter | Description | 
|---|---|
| sendbuff | [in] Buffer that will be sent. | 
| recvbuff | [out] Buffer that will be written into. | 
| count | [in] The number of elements to send. | 
| datatype | [in] The data type of the operation. | 
| comm | [in] The communicator on which to perform the all to all operation. | 
| stream_handle | [in] HCCL stream handle. | 
Return Value:
Operation status.
NCCL-compatible APIs¶
Intel Gaudi supports one device per process and one or multiple nodes. The supported API functions are listed below.
hcclGetVersion¶
hcclResult_t hcclGetVersion(int* version)
Operation:
The hcclGetVersion function returns the version number of the currently linked HCCL library.
This integer is coded with the MAJOR, MINOR and PATCH level of the HCCL library.
Parameters:
| Parameter | Description | 
|---|---|
| version | [out] Coded HCCL version. | 
Return Value:
Operation status.
hcclGetUniqueId¶
hcclResult_t hcclGetUniqueId(hcclUniqueId* uniqueId)
Operation:
One of two functions used for creating HCCL communicator. Creates a uniqueId required for hcclCommInitRank.
This function should be called only by one process within a new communicator. Obtained uniqueId is required for HCCL communicator initialization and must be distributed to other ranks using separate communication channel (e.g. MPI).
Parameters:
| Parameter | Description | 
|---|---|
| uniqueId | [out] commId used in hcclCommInitRank. | 
Return Value:
Operation status.
hcclCommInitRank¶
hcclResult_t hcclCommInitRank(hcclComm_t* comm, int nranks,
hcclUniqueId commId, int rank)
Operation:
Creates an HCCL communicator with given commId. This call is blocking and will return only after being called on every rank of new communicator.
commId should be obtained by hcclGetUniqueId on one of the ranks and distributed
using separate communication channel (e.g. MPI).
Each HCCL rank of created communicator is associated with a single Gaudi device.
Parameters:
| Parameter | Description | 
|---|---|
| comm | [out] Handle of initialized HCCL communicator | 
| nranks | [in] Total number of ranks in the created communicator clique. It can be 2, 4 and 8. | 
| commId | [in] Communicator Id returned by hcclGetUniqueId. Communicators created using same Id belong to same communicator clique. | 
| rank | [in] The rank of the current device/process. | 
Return Value:
Operation status.
hcclCommSynDevice¶
hcclResult_t hcclCommSynDevice(const hcclComm_t comm, int* device)
Operation:
Returns the Gaudi device ID associated with current process in the HCCL communicator comm. The device can be used to, for example, allocate device memory and so on. As there may be only one Gaudi device opened within a process, the returned value will always be 0.
Parameters:
| Parameter | Description | 
|---|---|
| comm | [in] The communicator created by hcclCommInitRank. | 
| device | [out] Gaudi device ID associated with comm. | 
hcclCommCount¶
hcclResult_t hcclCommCount(const hcclComm_t comm, int* count)
Operation:
Returns the total number of ranks in the HCCL communicator comm.
Parameters:
| Parameter | Description | 
|---|---|
| comm | [in] The communicator. | 
| count | [out] The total number of ranks. | 
hcclCommUserRank¶
hcclResult_t hcclCommUserRank(const hcclComm_t comm, int* rank)
Operation:
Returns the rank assigned to current process in the HCCL communicator comm.
Parameters:
| Parameter | Description | 
|---|---|
| comm | [in] The communicator. | 
| rank | [out] The rank. | 
hcclCommFinalize¶
hcclResult_t hcclCommFinalize(hcclComm_t comm)
Operation:
Finalizes the HCCL communicator (wait for completion of all operations).
Parameters:
| Parameter | Description | 
|---|---|
| comm | [in] The communicator to finalize. | 
Return Value:
Operation status.
hcclCommDestroy¶
hcclResult_t hcclCommDestroy(hcclComm_t comm)
Operation:
Destroys the HCCL communicator.
Parameters:
| Parameter | Description | 
|---|---|
| comm | [in] The communicator to destroy. | 
Return Value:
Operation status.
hcclBroadcast¶
hcclResult_t hcclBroadcast(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, int root, hcclComm_t comm,
synStreamHandle stream_handle))
Operation:
Broadcasts data from a single rank sendbuff to recvbuff of all others ranks that are part of HCCL communicator comm.
Parameters:
| Parameter | Description | 
|---|---|
| sendbuff | [in] Address of send buffer of root rank. | 
| recvbuff | [in] Address of receive buffer of receive rank. | 
| count | [in] The number of elements. | 
| datatype | [in] The data type of the operation. | 
| root | [in] The rank of the root of the broadcast operation. | 
| comm | [in] The communicator on which to broadcast. | 
| stream_handle | [in] HCCL stream handle. | 
Return Value:
Operation status.
hcclAllreduce¶
hcclResult_t hcclAllReduce(const void* sendbuff, void* recvbuff,
size_t count, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t comm,
synStreamHandle stream_handle))
Operation:
Reduces data from sendbuff across the entire HCCL communicator comm and distributes the results to recvbuff of every rank that is part of comm.
Parameters:
| Parameter | Description | 
|---|---|
| sendbuff | [in] Buffer that will be sent. | 
| recvbuff | [out] Buffer that will be written into. | 
| count | [in] The number of elements of the buffer that is reduced. | 
| datatype | [in] The data type of the operation. | 
| op | [in] The reduction operation to perform (summation for example). | 
| comm | [in] The communicator on which to perform the reduction operation. | 
| stream_handle | [in] HCCL stream handle. | 
Return Value:
Operation status.
hcclReduce¶
hcclResult_t hcclReduce(const void* sendbuff, void* recvbuff, size_t
count, hcclDataType_t datatype, hcclRedOp_t op, int root, hcclComm_t
comm, synStreamHandle stream_handle)
Operation:
Reduces data from sendbuff across the entire HCCL communicator comm. Result will be put only in recvbuffer of rank designed as root.
Parameters:
| Parameter | Description | 
|---|---|
| sendbuff | [in] Buffer that will be sent. | 
| recvbuff | [out] Buffer that will be written into in root | 
| count | [in] The number of elements of the buffer that is reduced. | 
| datatype | [in] The data type of the operation. | 
| op | [in] The reduction operation to perform (summation for example). | 
| root | [in] The rank who will contain the result. | 
| comm | [in] The communicator on which to perform the reduction operation. | 
| stream_handle | [in] HCCL stream handle. | 
Return Value:
Operation status.
hcclReduceScatter¶
hcclResult_t hcclReduceScatter(const void* sendbuff, void* recvbuff,
size_t recvcount, hcclDataType_t datatype, hcclRedOp_t op, hcclComm_t
comm, synStreamHandle stream_handle)
Operation:
Performs reduce-scatter of the data across the entire HCCL communicator comm.
Parameters:
| Parameter | Description | 
|---|---|
| sendbuff | [in] Buffer that will be sent. | 
| recvbuff | [out] Buffer that will be written into. | 
| recvcount | [in] The number of elements of the buffer that is reduced. | 
| datatype | [in] The data type of the operation. | 
| op | [in] The reduction operation to perform (summation for example). | 
| comm | [in] The communicator on which to perform the reduction operation. | 
| stream_handle | [in] HCCL stream handle. | 
Return Value:
Operation status.
hcclAllGather¶
hcclResult_t hcclAllGather(const void* sendbuff, void* recvbuff,
size_t sendcount, hcclDataType_t datatype, hcclComm_t comm, synStreamHandle
stream_handle)
Operation:
Performs allgather across the entire HCCL communicator comm.
Parameters:
| Parameter | Description | 
|---|---|
| sendbuff | [in] Buffer that will be sent. | 
| recvbuff | [out] Buffer that will be written into. | 
| sendcount | [in] The number of elements of the buffer that is moved. | 
| datatype | [in] The data type of the operation. | 
| comm | [in] The communicator on which to perform the reduction operation. | 
| stream_handle | [in] HCCL stream handle. | 
Return Value:
Operation status.
hcclGroupStart¶
hcclResult_t hcclGroupStart();
Operation:
Starts a group call.
Parameters:
No parameters.
Return Value:
Operation status.
hcclGroupEnd¶
hcclResult_t hcclGroupEnd();
Operation:
Ends a group call. Starts execution of all calls collected since hcclGroupStart.
Parameters:
No parameters.
Return Value:
Operation status.
hcclSend¶
hcclResult_t hcclSend(const void* sendbuff, size_t count, hcclDataType_t
datatype, int peer, hcclComm_t comm, synStreamHandle stream);
Operation:
Performs allgather across the entire HCCL communicator comm.
Parameters:
| Parameter | Description | 
|---|---|
| sendbuff | [in] Buffer that will be sent. | 
| count | [in] The number of elements of the buffer that will be sent. | 
| datatype | [in] The data type of the operation. | 
| peer | [in] The rank peer that the data should be sent to. | 
| comm | [in] The communicator on which to perform the reduction operation. | 
| stream | [in] HCCL stream handle. | 
Return Value:
Operation status.
hcclRecv¶
hcclResult_t hcclSend(const void* sendbuff, size_t count, hcclDataType_t
datatype, int peer, hcclComm_t comm, synStreamHandle stream);
hcclResult_t hcclRecv(void* recvbuff, size_t count, hcclDataType_t
datatype, int peer, hcclComm_t comm, synStreamHandle stream)
Operation:
Performs allgather across the entire HCCL communicator comm.
Parameters:
| Parameter | Description | 
|---|---|
| recvbuff | [in] Buffer that received data will be written to. | 
| count | [in] Number of elements that should be received. | 
| datatype | [in] The data type of the operation. | 
| peer | [in] The rank peer that the data should be received from. | 
| comm | [in] The communicator on which to perform the reduction operation. | 
| stream | [in] HCCL stream handle. | 
Return Value:
Operation status.
Example¶
The following C++ code is an example of setting up HCCL and performing hcclAllreduce on a buffer of 4 floats:
// C++ Standard Libraries
#include <iostream>
#include <exception>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <iomanip>
#include <unistd.h>
// Open MPI
#include <mpi.h>
// HCCL :: Habana Collective Communications Library
#include <hccl/hccl.h>
// Synapse :: Habana Synapse training API
#include <synapse_api.h>
using namespace std;
#define HLS1H_MODULE_CNT 4
#define HLS1_MODULE_CNT 8
// ------------------------------------------------------------------------------------------------
#define CHECK_MPI_STATUS(x)                                                             \
{                                                                                       \
    const auto _res = (x);                                                              \
    if (_res != MPI_SUCCESS)                                                            \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +               \
                            "(): " #x " failed with code: " + std::to_string(_res)};    \
}
#define CHECK_HCCL_STATUS(x)                                                      \
{                                                                                 \
    const auto _res = (x);                                                        \
    if (_res != hcclSuccess)                                                      \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +         \
                            "(): " #x " failed: " + hcclGetErrorString(_res)};    \
}
#define CHECK_SYNAPSE_STATUS(x)                                                                    \
{                                                                                                  \
    const auto _res = (x);                                                                         \
    if (_res != synSuccess)                                                                        \
    throw std::runtime_error{"In function " + std::string{__FUNCTION__} +                          \
                            "(): " #x " failed with synapse error: " + std::to_string((_res))};    \
}
std::ostream& log() { return std::cout; }
template <typename T>
std::ostream& operator<<(std::ostream& out, const std::vector<T>& data) {
out << '[';
if (!data.empty()) {
    out << data[0];
    for (size_t i = 1; i < data.size(); ++i) {
    out << " " << data[i];
    }
}
return out << ']';
}
std::string get_test_type() {
static const auto default_hccl_test_type = std::string{"Broadcast"};
const char* env_value = getenv("HCCL_EXAMPLE_TEST");
return (env_value != nullptr) ? std::string(env_value) : default_hccl_test_type;
}
std::string get_example_hcl_type() {
static const auto default_hcl_type = std::string{"HLS1"};
const char* env_value = getenv("HCL_TYPE");
return (env_value != nullptr) ? std::string(env_value) : default_hcl_type;
}
int get_example_test_root() {
static const auto default_hccl_test_root = 1;
const char* env_value = getenv("HCCL_EXAMPLE_TEST_ROOT");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_root;
}
int get_example_test_size() {
static const auto default_hccl_test_size = 25; /* 32M */
const char* env_value = getenv("HCCL_EXAMPLE_TEST_SIZE");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_size;
}
int get_example_test_count() {
const char* env_value = getenv("HCCL_EXAMPLE_TEST_COUNT");
return (env_value != nullptr) ? atoi(env_value) : -1;
}
int get_example_test_loop() {
static const auto default_hccl_test_loop = 1;
const char* env_value = getenv("HCCL_EXAMPLE_TEST_LOOP");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_loop;
}
bool get_example_test_profile() {
static const auto default_hccl_test_profile = false;
const char* env_value = getenv("HABANA_PROFILE");
return (env_value != nullptr) ? atoi(env_value) : default_hccl_test_profile;
}
// ------------------------------------------------------------------------------------------------
int main(int argc, char* argv[]) {
try {
    log() << "Running HCCL Example :: A simple program demonstrating HCCL usage from C++" << std::endl;
    // Initialize the Open MPI execution context.
    CHECK_MPI_STATUS(MPI_Init(&argc, &argv));
    // Get MPI rank and size.
    int mpi_rank{};
    CHECK_MPI_STATUS(MPI_Comm_rank(MPI_COMM_WORLD, &mpi_rank));
    int mpi_size{};
    CHECK_MPI_STATUS(MPI_Comm_size(MPI_COMM_WORLD, &mpi_size));
    if (mpi_size < 2) {
    throw std::runtime_error{"There is only one MPI rank. Run this program using 'mpirun' or 'mpiexec'"};
    }
    // First acquire device and  create streams. In most cases of real world implementation this is done by framework,
    // bit here we are running using poor synapse.
    synDeviceId device_handle{};
    const synModuleId device_module_id =
        mpi_rank % (get_example_hcl_type() == "HLS1" ? HLS1_MODULE_CNT : HLS1H_MODULE_CNT);
    CHECK_SYNAPSE_STATUS(synDeviceAcquireByModuleId(&device_handle, device_module_id));
    // Create a new stream handle. There is no need to destroy it any way.
    synStreamHandle collective_stream{};
    synStreamHandle device_to_host_stream{};
    synStreamHandle host_to_device_stream{};
    CHECK_SYNAPSE_STATUS(synStreamCreate(&collective_stream, device_handle, STREAM_TYPE_NETWORK_COLLECTIVE, 0));
    CHECK_SYNAPSE_STATUS(synStreamCreate(&device_to_host_stream, device_handle, STREAM_TYPE_COPY_DEVICE_TO_HOST, 0));
    CHECK_SYNAPSE_STATUS(synStreamCreate(&host_to_device_stream, device_handle, STREAM_TYPE_COPY_HOST_TO_DEVICE, 0));
    // Generate Unique Id on rank 0 and propagate it to other ranks using Open MPI.
    hcclUniqueId unique_id{}, hccl_comm_world_id{};
    constexpr int master_mpi_rank = 0;
    hcclComm_t hccl_comm_world{};
    CHECK_HCCL_STATUS(hcclCommInitRank(&hccl_comm_world, mpi_size, hccl_comm_world_id, mpi_rank));
    if (mpi_rank == master_mpi_rank) {
    CHECK_HCCL_STATUS(hcclGetUniqueId(&unique_id));
    }
    CHECK_MPI_STATUS(MPI_Bcast(&unique_id, sizeof(unique_id), MPI_BYTE, master_mpi_rank, MPI_COMM_WORLD));
    // Create a new HCCL communicator.
    hcclComm_t hccl_comm{};
    CHECK_HCCL_STATUS(hcclCommInitRank(&hccl_comm, mpi_size, unique_id, mpi_rank));
    // Allocate some buffer on the HPU device.
    // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
    /* If the user provides count env variable - use that for size calculation
    * Else, use default or provided test size env variable.
    */
    uint64_t count = get_example_test_count();
    size_t data_size;
    int size;
    if (count == -1) {
    size = get_example_test_size();
    data_size = 1 << size;
    count = data_size / sizeof(float);
    } else {
    data_size = count * sizeof(float);
    size = -1;
    }
    auto input_host_data = std::vector<float>(count, mpi_rank);
    const void* input_host_data_ptr = reinterpret_cast<void*>(input_host_data.data());
    uint64_t input_dev_ptr{};
    uint64_t output_dev_ptr{};
    CHECK_SYNAPSE_STATUS(synDeviceMalloc(device_handle, data_size, 0, 0, &input_dev_ptr));
    CHECK_SYNAPSE_STATUS(synDeviceMalloc(device_handle, data_size, 0, 0, &output_dev_ptr));
    CHECK_SYNAPSE_STATUS(synHostMap(device_handle, data_size, input_host_data_ptr));
    CHECK_SYNAPSE_STATUS(
        synMemCopyAsync(host_to_device_stream, (uint64_t)input_host_data_ptr, data_size, input_dev_ptr, HOST_TO_DRAM));
    CHECK_SYNAPSE_STATUS(synStreamSynchronize(host_to_device_stream));
    const int LOOP = get_example_test_loop();
    chrono::time_point<chrono::high_resolution_clock> t_begin, t_end;
    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.
    std::string test_type = get_test_type();
    if (test_type == "Broadcast") {
    int root = get_example_test_root();
    log() << "mpi_rank=" << mpi_rank << " Running Broadcast Test root=" << root << std::endl;
    for (int i = 0; i < LOOP; ++i) {
        CHECK_HCCL_STATUS(hcclBroadcast((const void*)input_dev_ptr, (void*)output_dev_ptr, input_host_data.size(),
                                        hcclFloat32, root, hccl_comm, collective_stream));
    }
    } else if (test_type == "Allreduce") {
    log() << "mpi_rank=" << mpi_rank << " Running Allreduce Test" << std::endl;
    if (get_example_test_profile()) {
        std::cout << "Profile On" << std::endl;
        CHECK_SYNAPSE_STATUS(synProfilerStart(synTraceDevice, device_handle));
    }
    t_begin = chrono::high_resolution_clock::now();
    for (int i = 0; i < LOOP; ++i) {
        // Perform an All-Reduce operation on the device buffer.
        CHECK_HCCL_STATUS(hcclAllReduce((const void*)input_dev_ptr, (void*)output_dev_ptr, input_host_data.size(),
                                        hcclFloat32, hcclSum, hccl_comm, collective_stream));
    }
    }
    CHECK_SYNAPSE_STATUS(synStreamSynchronize(collective_stream));
    t_end = chrono::high_resolution_clock::now();
    auto dt = chrono::duration<double>(t_end - t_begin).count();
    if (get_example_test_profile()) {
    CHECK_SYNAPSE_STATUS(synProfilerStop(synTraceDevice, device_handle));
    CHECK_SYNAPSE_STATUS(
        synProfilerGetTrace(synTraceDevice, device_handle, synTraceFormatTEF, nullptr, nullptr, nullptr));
    }
    // There is no need to call hcclWaitForCompletion(stream).
    // A synchronization on dev_ptr device address will be automatic.
    // Copy the data back to the host memory.
    auto output_host_data = std::vector<float>(input_host_data.size());
    const void* output_host_data_ptr = reinterpret_cast<void*>(output_host_data.data());
    CHECK_SYNAPSE_STATUS(synHostMap(device_handle, data_size, output_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synMemCopyAsync(device_to_host_stream, output_dev_ptr, data_size,
                                        (uint64_t)output_host_data_ptr, DRAM_TO_HOST));
    CHECK_SYNAPSE_STATUS(synStreamSynchronize(device_to_host_stream));
    // There is no need to call hcclWaitForCompletion(stream), as device->host memcopy is blocking.
    // Check if the data has been reduced correctly.
    bool is_ok = true;
    if (test_type == "Broadcast") {
    int root = get_example_test_root();
    for (size_t i = 0; i < input_host_data.size(); ++i) {
        // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
        if (std::abs(output_host_data[i] - (float)(root)) != 0) {
        is_ok = false;
        }
    }
    log() << "Broadcast mpi_rank=" << mpi_rank << " loop=" << LOOP << " size=" << size << " count=" << count
            << " <float>"
            << " Input Buffer [" << input_host_data[0] << " " << input_host_data[1] << " " << input_host_data[2] << " "
            << input_host_data[3] << " ...]"
            << " Output Buffer [" << output_host_data[0] << " " << output_host_data[1] << " " << output_host_data[2]
            << " " << output_host_data[3] << " ...]"
            << " which is " << (is_ok ? "fine." : "bad.") << std::endl;
    } else if (test_type == "Allreduce") {
    for (size_t i = 0; i < input_host_data.size(); ++i) {
        // NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)
        if (std::abs(output_host_data[i] - static_cast<float>(mpi_size) * (mpi_size - 1) / 2) > 1e-10f) {
        is_ok = false;
        }
    }
    log() << "Allreduce mpi_rank=" << mpi_rank << " size=" << size << " count=" << count << " <float>"
            << " Input Buffer [" << input_host_data[0] << " " << input_host_data[1] << " " << input_host_data[2] << " "
            << input_host_data[3] << " ...]"
            << " reduced to Output Buffer [" << output_host_data[0] << " " << output_host_data[1] << " "
            << output_host_data[2] << " " << output_host_data[3] << " ...]"
            << " which is " << (is_ok ? "fine." : "bad.") << std::endl;
    cout << "AllReduce[" << mpi_rank << "] with data_size=" << data_size << ", loop=" << LOOP << ": " << std::fixed
        << std::setprecision(3)
        << ((((double)LOOP * (2.0 * (double)data_size / mpi_size * (mpi_size - 1))) / 1e6) / dt) << " MB/s/card in "
        << dt << " seconds" << endl;
    }
    // Destroy a HCCL communicator.
    CHECK_HCCL_STATUS(hcclCommDestroy(hccl_comm));
    CHECK_HCCL_STATUS(hcclCommDestroy(hccl_comm_world));
    CHECK_SYNAPSE_STATUS(synHostUnmap(device_handle, input_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synHostUnmap(device_handle, output_host_data_ptr));
    CHECK_SYNAPSE_STATUS(synDeviceFree(device_handle, input_dev_ptr, 0));
    CHECK_SYNAPSE_STATUS(synDeviceFree(device_handle, output_dev_ptr, 0));
    // Clean up HCCL.
    CHECK_SYNAPSE_STATUS(synDeviceRelease(device_handle));
    // Clean up Open MPI.
    CHECK_MPI_STATUS(MPI_Finalize());
} catch (const std::exception& ex) {
    log() << "error: " << ex.what() << std::endl;
    return -1;
}
return 0;
}
