Setting Up Distributed Training

HCCL Initialization

Habana support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend. Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:

import habana_frameworks.torch.distributed.hccl
torch.distributed.init_process_group(backend='hccl')
Copy to clipboard

In the example above, it is assumed either torchrun or mpirun was used to start training and all necessary environment variables are set before habana_frameworks.torch.distributed.hccl import.

For further details on distributed training, see Distributed Training with PyTorch.