Distributed Backend Initialization
On this Page
Distributed Backend Initialization¶
PyTorch supports distributed communication using
APIs for both data and model parallelism.
PyTorch supports a few communication backends such as MPI, Gloo and NCCL natively.
Habana support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend.
Habana PyTorch resolves device mapping automatically based on the environment variables set by the launcher - either
In more advanced scenarios where manual device to process mapping is required, see Gaudi-to-process Assignment for reference.
The following script loads the HCCL communication backend and initializes process group communication backend as “hccl”:
import habana_frameworks.torch.distributed.hccl torch.distributed.init_process_group(backend='hccl')
The above example assumes training is launched using either
and all necessary environment variables are set before
To add communication hooks such as DDP, refer to the PyTorch Model References GitHub page.
Setting Environment Variables for Custom Launchers¶
For custom launchers, use
initialize_distributed_hpu(world_size, rank, local_rank) to set
all environment variables necessary for training:
import habana_frameworks.torch.distributed.hccl as hccl hccl.initialize_distributed_hpu(world_size=world_size, rank=rank, local_rank=local_rank) torch.distributed.init_process_group(backend='hccl')
Alternatively, you can manually set the following environment variables before importing
os.environ['WORLD_SIZE'] = world_size os.environ['RANK'] = rank os.environ['LOCAL_RANK'] = local_rank ... import habana_frameworks.torch.distributed.hccl torch.distributed.init_process_group(backend='hccl')
To obtain the variable values set in the environment by the launcher, call
initialize_distributed_hpu() without any parameters:
import habana_frameworks.torch.distributed.hccl as hccl world_size, rank, local_rank = hccl.initialize_distributed_hpu()