Scale-Out TopologyΒΆ

The following table summarizes the scale-out topology further described in DDP-based Scaling of Gaudi on PyTorch.

Scale Out Topology

Distributed Framework

Scaling Via

Flags

Notes

AWS DL1 / Host NIC Scaling

PyTorch DDP

Libfabric

HCCL_OVER_OFI=1

Refer to Scale-Out via Host-NIC over OFI

PyTorch DDP

Native TCP/IP

HCCL_OVER_TCP=1

Refer to Scale-Out via Host-NIC over TCP

Gaudi based

PyTorch DDP

RDMA

None

Default mode = RDMA

Note

  • Using HCCL_OVER_OFI=1 flag requires disabling HCCL_OVER_TCP by setting it to HCCL_OVER_TCP=0.

  • Using HCCL_OVER_TCP=1 flag requires disabling HCCL_OVER_OFI by setting it to HCCL_OVER_OFI=0.