Distributed Backend Initialization

PyTorch supports distributed communication using torch.distributed and torch.nn.parallel.DistributedDataParallel APIs for both data and model parallelism. PyTorch supports a few communication backends like MPI, Gloo and NCCL natively. Habana support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend.

Device Mapping

For PyTorch distributed to work correctly, you need to export environment variable ID. ID is mapped to the local rank which is used to acquire the Gaudi card for a particular process in case of multi-node.

os.environ["ID"] = local_rank

HCCL Initialization

Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:

import habana_frameworks.torch.distributed.hccl
torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size)

To add communication hooks like DDP, users can follow the PyTorch Model References.