Using HCCL
On this Page
Using HCCL¶
HCCL Library¶
To access HCCL C API from C/C++:
#include <hccl/hccl.h>
This header defines HCCL symbols, for example hcclGetVersion
function.
For symbols which have their counterpart in NCCL API, portability macros are defined as well:
#define ncclGetVersion hcclGetVersion
MPI¶
Throughout this document, Open MPI usage is provided in all examples. However, HCCL does not have an Open MPI dependency and can be built and used without it.
In HCCL, application MPI (or equivalent) is required for: # Broadcasting unique_id to all workers. # Spawn all the worker processes on multiple nodes.
The below is an example of running HCCL application with MPI:
mpirun -np 8 --tag-output my_program
In addition, any program using both HCCL and MPI must initialize MPI execution context prior to using MPI for broadcasting unique_id.
MPI_Init(&argc, &argv);
HCCL Runtime Initialization¶
Before the HCCL communicator is created, the Gaudi device needs to be acquired using SYN_API
calls.
synDeviceId device_handle{};
const synModuleId device_module_id {mpi_rank % MODULES_PER_NODE};
synDeviceAcquireByModuleId(&device_handle, device_module_id);
This is a simple flow for opening the device. module_id should be a number in range [0, MODULES_PER_NODE-1] and mpi_rank is used to assign different devices for every process.
MODULES_PER_NODE
variable is the number of devices available on single host.
Prior to exiting, the program should release the device by calling synDeviceRelease passing a handle received from synDeviceAcquireByModuleId call.
synDeviceRelease(device_handle);
Note
Currently, HCCL supports only single device per process, and by default all HCCL calls use the device that was acquired first within the given process context.
HCCL Environment Variables¶
During the HCCL initialization phase, the environment variables presented in the table below are used.
Parameter |
Short Description |
Value if not set |
---|---|---|
HCCL_BASE_PORT |
Port range used for sideband communication. |
45000 |
HCCL_SOCKET_IFNAME |
Prefix of name of network interface used for sideband communication. |
Auto-detected (see description) |
HCCL_BASE_PORT¶
HCCL_BASE_PORT
defines the beginning of the port range that should be used for HCCL sideband TCP communication.
The ports used by HCCL are in range [HCCL_BASE_PORT, HCCL_BASE_PORT+100].
HCCL_SOCKET_IFNAME¶
HCCL_SOCKET_IFNAME
defines the prefix of the network interface name that is used for HCCL sideband TCP communication.
If not set, the first network interface with a name that does not start with lo or docker will be used.
HCCL Streams and Asynchronicity of Issued Operations¶
Synchronization between communication and compute is done using SYN_API
stream calls.
For calling HCCL, you first need to acquire stream:
synStreamHandle collective_stream{};
synStreamHandle device_to_host_stream{};
synStreamHandle host_to_device_stream{};
synStreamCreate(&collective_stream, device_handle, STREAM_TYPE_NETWORK_COLLECTIVE, 0);
synStreamCreate(&device_to_host_stream, device_handle, STREAM_TYPE_COPY_DEVICE_TO_HOST, 0);
synStreamCreate(&host_to_device_stream, device_handle, STREAM_TYPE_COPY_HOST_TO_DEVICE, 0);
In the above example, synStreamCreate is called many times for obtaining different streams for different purposes.
All collective operations are asynchronous - implemented as non-blocking calls. After an asynchronous call, another collective operation may be called immediately after as long as it uses the same stream.
When the next operation uses a different stream, synchronization needs to be added. It can be done either in a blocking manner, using synStreamSynchronize, or in a non-blocking manner, using synEventRecord and synStreamWaitEvent pair.
...
hcclAllReduce(input_dev_ptr, output_dev_ptr, elem_cnt,
hcclFloat32, hcclSum, hccl_comm, collective_stream);
// Create event that will mark end of *hcclAllReduce* operation
synEventHandle allreduce_event;
synEventCreate(&allreduce_event, device_handle, 0);
synEventRecord(allreduce_event, collective_stream, 0);
// Signal that all the work on *device_to_host_stream* should wait for *allreduce_event*
synStreamWaitEvent(device_to_host_stream, allreduce_event, 0);
// Schedule copy request from device to host
synMemCopyAsync(device_to_host_stream, output_dev_ptr, data_size, host_buffer_ptr, DRAM_TO_HOST)
// Wait (in blocking manner) for data to reach the host
synStreamSynchronize(device_to_host_stream);
...
The above example shows how to synchronize calls to hcclAllReduce by copying data to host. After all operations are submitted on streams, blocking synStreamSynchronize is called. This is done for blocking wait until all the data is copied on the host.