Setting Up Distributed Training
On this Page
Setting Up Distributed Training¶
HCCL Initialization¶
Habana support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend. Support for HCCL communication backend is loaded and process group communication backend is initialized as “hccl” using the following script changes:
In the example above, it is assumed either torchrun or mpirun was used to start training
and all necessary environment variables are set before habana_frameworks.torch.distributed.hccl
import.
For further details on distributed training, see Distributed Training with PyTorch.