Using HCCL
On this Page
Using HCCL¶
HCCL Library¶
To access HCCL C API from C/C++, run the following:
#include <hccl/hccl.h>
This header defines HCCL symbols, for example hcclGetVersion
function.
In addition, portability macros are defined for symbols that have their counterparts in NCCL API:
#define ncclGetVersion hcclGetVersion
MPI¶
Throughout this document, Open MPI usage is provided in all examples. However, HCCL does not have an Open MPI dependency and can be built and used without it.
In HCCL, application MPI (or equivalent) is required for:
Broadcasting unique_id to all workers.
Spawning all the worker processes on multiple nodes.
The below is an example of running HCCL application with MPI:
mpirun -np 8 --tag-output my_program
In addition, any program using both HCCL and MPI must initialize MPI execution context prior to using MPI for broadcasting unique_id.
MPI_Init(&argc, &argv);
HCCL Runtime Initialization¶
Before the HCCL communicator is created, the Gaudi device needs to be acquired using SYN_API
calls.
The below example is a simple flow for initiating the device:
synDeviceId device_handle{};
const synModuleId device_module_id {mpi_rank % MODULES_PER_NODE};
synDeviceAcquireByModuleId(&device_handle, device_module_id);
module_id
should be a number in range [0, MODULES_PER_NODE-1
].mpi_rank
is used to assign different devices for every process.MODULES_PER_NODE
variable is the number of devices available on single host.
Prior to exiting, the program should release the device by calling synDeviceRelease
passing a handle received from synDeviceAcquireByModuleId
call:
synDeviceRelease(device_handle);
Note
Currently, HCCL supports only single device per process, and by default all HCCL calls use the device that was acquired first within the given process context.
HCCL Environment Variables¶
During the HCCL initialization phase, the environment variables presented in the table below are used.
Parameter |
Short Description |
Value if not set |
---|---|---|
|
Port range used for sideband communication. |
45000 |
|
Prefix of name of network interface used for sideband communication. |
Auto-detected (see description) |
HCCL_BASE_PORT¶
HCCL_BASE_PORT
defines the beginning of the port range that should be used for HCCL sideband TCP communication.
The ports used by HCCL are in range [HCCL_BASE_PORT
, HCCL_BASE_PORT+100
].
HCCL_SOCKET_IFNAME¶
HCCL_SOCKET_IFNAME
defines the prefix of the network interface name that is used for HCCL sideband TCP communication.
If not set, the first network interface with a name that does not start with lo or docker will be used.
HCCL Streams and Asynchronicity of Issued Operations¶
Synchronization between communication and compute is done using SYN_API
stream calls.
First, you need to acquire stream to call HCCL:
synStreamHandle collective_stream{};
synStreamHandle device_to_host_stream{};
synStreamHandle host_to_device_stream{};
synStreamCreate(&collective_stream, device_handle, STREAM_TYPE_NETWORK_COLLECTIVE, 0);
synStreamCreate(&device_to_host_stream, device_handle, STREAM_TYPE_COPY_DEVICE_TO_HOST, 0);
synStreamCreate(&host_to_device_stream, device_handle, STREAM_TYPE_COPY_HOST_TO_DEVICE, 0);
In the above example, synStreamCreate
is called many times to obtain different streams for different purposes.
All collective operations are asynchronous and implemented as non-blocking calls. After an asynchronous call, another collective operation may be called immediately after as long as it uses the same stream.
When the next operation uses a different stream, synchronization needs to be added.
It can be done either in a blocking manner, using synStreamSynchronize
, or in a non-blocking manner, using synEventRecord
and synStreamWaitEvent
pair. For example:
...
hcclAllReduce(input_dev_ptr, output_dev_ptr, elem_cnt,
hcclFloat32, hcclSum, hccl_comm, collective_stream);
// Create event that will mark end of *hcclAllReduce* operation
synEventHandle allreduce_event;
synEventCreate(&allreduce_event, device_handle, 0);
synEventRecord(allreduce_event, collective_stream, 0);
// Signal that all the work on *device_to_host_stream* should wait for *allreduce_event*
synStreamWaitEvent(device_to_host_stream, allreduce_event, 0);
// Schedule copy request from device to host
synMemCopyAsync(device_to_host_stream, output_dev_ptr, data_size, host_buffer_ptr, DRAM_TO_HOST)
// Wait (in blocking manner) for data to reach the host
synStreamSynchronize(device_to_host_stream);
...
The above example shows how to synchronize calls to hcclAllReduce
by copying data to host.
After all operations are submitted on streams, blocking synStreamSynchronize
is called.
This is done for blocking wait until all the data is copied on the host.