Distributed Backend Initialization

PyTorch supports distributed communication using torch.distributed and torch.nn.parallel.DistributedDataParallel APIs for both data and model parallelism. PyTorch supports a few communication backends such as MPI, Gloo and NCCL natively. Intel® Gaudi® AI accelerator support for distributed communication can be enabled using HCCL (Habana Collective Communication Library) backend.

Device Mapping

Intel Gaudi PyTorch resolves device mapping automatically based on the environment variables set by the launcher - either torchrun or mpirun. In more advanced scenarios where manual device to process mapping is required, see Gaudi-to-process Assignment for reference.

HCCL Initialization

The following script loads the HCCL communication backend and initializes process group communication backend as “hccl”:

import habana_frameworks.torch.distributed.hccl
torch.distributed.init_process_group(backend='hccl')

The above example assumes training is launched using either torchrun or mpirun and all necessary environment variables are set before habana_frameworks.torch.distributed.hccl import.

To add communication hooks such as DDP, refer to the PyTorch Model References GitHub page.

Setting Environment Variables for Custom Launchers

For custom launchers, use initialize_distributed_hpu(world_size, rank, local_rank) to set all environment variables necessary for training:

import habana_frameworks.torch.distributed.hccl as hccl

hccl.initialize_distributed_hpu(world_size=world_size, rank=rank, local_rank=local_rank)
torch.distributed.init_process_group(backend='hccl')

Alternatively, you can manually set the following environment variables before importing habana_frameworks.torch.distributed.hccl:

os.environ['WORLD_SIZE'] = world_size
os.environ['RANK'] = rank
os.environ['LOCAL_RANK'] = local_rank
...
import habana_frameworks.torch.distributed.hccl
torch.distributed.init_process_group(backend='hccl')

To obtain the variable values set in the environment by the launcher, call initialize_distributed_hpu() without any parameters:

import habana_frameworks.torch.distributed.hccl as hccl
world_size, rank, local_rank = hccl.initialize_distributed_hpu()