Scale-Out Topology

The following table summarizes the scale-out topology further described in DDP-based Scaling of Gaudi on PyTorch.

Scale Out Topology

Distributed Framework

Scaling Via

Flags

Notes

Verbs Host NIC Scaling

PyTorch DDP

Libfabric

RDMAV_FORK_SAFE=1 MLX5_SCATTER_TO_CQE=0

See Scale-out Using Host NICs for more details.

AWS EFA Host NIC Scaling

PyTorch DDP

Libfabric

RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1

Gaudi based

PyTorch DDP

RDMA

None

Default mode = RDMA