Distributed Backend Initialization

PyTorch supports distributed communication using torch.distributed and torch.nn.parallel.DistributedDataParallel APIs for both data and model parallelism. PyTorch supports a few communication backends like MPI, Gloo and NCCL natively. Habana support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend.

Device Mapping

For PyTorch distributed to work correctly, you need to export environment variable ID. ID is mapped to the local rank which is used to acquire the Gaudi card for a particular process in case of multi-node.

os.environ["ID"] = local_rank

Alternatively you can call initialize_distributed_hpu() in the script. See Gaudi-to-process Assignment for reference.

HCCL Initialization

Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:

import habana_frameworks.torch.distributed.hccl
torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size)

In the example above, it is assumed that the rank and world_size variables are collected at the beginning of the script. If needed, this optional function can be used world_size, rank, local_rank = initialize_distributed_hpu().

To add communication hooks like DDP, users can follow the PyTorch Model References.