Using HCCL

HCCL Library

The HCCL library is provided as a dynamically-linked library named libhccl.so.X.Y.Z, where X.Y.Z denotes Abseil compatible with a specific TensorFlow version. For instance, a library named libhccl.so.2.4.1 has Abseil ABI compatible with TensorFlow 2.4.1. However, the HCCL library does not depend on TensorFlow and technically may be used without it.

Note

Library name suffix as well as Abseil will be removed in future releases.

To access HCCL C API from C/C++:

#include <hccl/hccl.h>

This header defines HCCL symbols, for example hcclGetVersion function. For symbols which have their counterpart in NCCL API, portability macros are defined as well:

#define ncclGetVersion hcclGetVersion

MPI

Throughout this document, Open MPI usage is provided in all examples. However, HCCL does not have an Open MPI dependency and can be built and used without it.

In HCCL, application MPI (or equivalent) is required for: # Broadcasting unique_id to all workers. # Spawn all the worker processes on multiple nodes.

The below is an example of running HCCL application with MPI:

mpirun -np 8 --tag-output my_program

In addition, any program using both HCCL and MPI must initialize MPI execution context prior to using MPI for broadcasting unique_id.

MPI_Init(&argc, &argv);

HCCL Runtime Initialization

Before the HCCL communicator is created, the Gaudi device needs to be acquired using Synapse API calls.

synDeviceId device_handle{};
const synModuleId device_module_id {mpi_rank % MODULES_PER_NODE};
synDeviceAcquireByModuleId(&device_handle, device_module_id);

This is a simple flow for opening the device. module_id should be a number in range [0, MODULES_PER_NODE-1] and mpi_rank is used to assign different devices for every process.

MODULES_PER_NODE variable is the number of devices available on single host.

Prior to exiting, the program should release the device in Synapse by calling synDeviceRelease passing a handle received from synDeviceAcquireByModuleId call.

synDeviceRelease(device_handle);

Warning

Currently, HCCL supports only single device per process, and by default all HCCL calls use the device that was acquired first within the given process context.

HCCL Environment Variables

During the HCCL initialization phase, the environment variables presented in the table below are used.

Parameter

Short Description

Value if not set

HCCL_BASE_PORT

Port range used for sideband communication.

45000

HCCL_SOCKET_IFNAME

Prefix of name of network interface used for sideband communication.

Auto-detected (see description)

HCCL_BASE_PORT

HCCL_BASE_PORT defines the beginning of the port range that should be used for HCCL sideband TCP communication. The ports used by HCCL are in range [HCCL_BASE_PORT, HCCL_BASE_PORT+100].

HCCL_SOCKET_IFNAME

HCCL_SOCKET_IFNAME defines the prefix of the network interface name that is used for HCCL sideband TCP communication. If not set, the first network interface with a name that does not start with lo or docker will be used.

HCCL Streams and Asynchronicity of Issued Operations

Synchronization between communication and compute is done using Synapse API stream calls. For calling HCCL, you first need to acquire stream from Synapse:

synStreamHandle collective_stream{};
synStreamHandle device_to_host_stream{};
synStreamHandle host_to_device_stream{};

synStreamCreate(&collective_stream, device_handle, STREAM_TYPE_NETWORK_COLLECTIVE, 0);
synStreamCreate(&device_to_host_stream, device_handle, STREAM_TYPE_COPY_DEVICE_TO_HOST, 0);
synStreamCreate(&host_to_device_stream, device_handle, STREAM_TYPE_COPY_HOST_TO_DEVICE, 0);

In the above example, synStreamCreate is called many times for obtaining different streams for different purposes.

All collective operations are asynchronous - implemented as non-blocking calls. After an asynchronous call, another collective operation may be called immediately after as long as it uses the same Synapse stream.

When the next operation uses a different stream, synchronization needs to be added. It can be done either in a blocking manner, using synStreamSynchronize, or in a non-blocking manner, using synEventRecord and synStreamWaitEvent pair.

...
hcclAllReduce(input_dev_ptr, output_dev_ptr, elem_cnt,
    hcclFloat32, hcclSum, hccl_comm, collective_stream);
// Create event that will mark end of *hcclAllReduce* operation
synEventHandle allreduce_event;
synEventCreate(&allreduce_event, device_handle, 0);
synEventRecord(allreduce_event, collective_stream, 0);
// Signal that all the work on *device_to_host_stream* should wait for *allreduce_event*
synStreamWaitEvent(device_to_host_stream,  allreduce_event, 0);
// Schedule copy request from device to host
synMemCopyAsync(device_to_host_stream, output_dev_ptr, data_size, host_buffer_ptr, DRAM_TO_HOST)
// Wait (in blocking manner) for data to reach the host
synStreamSynchronize(device_to_host_stream);
...

The above example shows how to synchronize calls to hcclAllReduce by copying data to host. After all operations are submitted on streams, blocking synStreamSynchronize is called. This is done for blocking wait until all the data is copied on the host. More information can be found in the SynapseAI Training API Reference documentation.