Setting Up Distributed Training
On this Page
Setting Up Distributed Training¶
Habana support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend. Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:
import habana_frameworks.torch.distributed.hccl torch.distributed.init_process_group(backend='hccl')
In the example above, it is assumed either torchrun or mpirun was used to start training
and all necessary environment variables are set before
For further details on distributed training, see Distributed Training with PyTorch.