Distributed Backend Initialization¶
PyTorch supports distributed communication using
APIs for both data and model parallelism.
PyTorch supports a few communication backends like MPI, Gloo and NCCL natively.
Habana support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend.
For PyTorch distributed to work correctly, you need to export environment variable
ID is mapped to the local
rank which is used to acquire the Gaudi card for a particular process in case of multi-node.
os.environ["ID"] = local_rank
Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:
import habana_frameworks.torch.distributed.hccl torch.distributed.init_process_group(backend='hccl', rank=rank, world_size=world_size)
To add communication hooks like DDP, users can follow the PyTorch Model References.