Scale-out Topology
Scale-out Topology¶
The following table summarizes the scale-out topology further described in DDP-based Scaling of Gaudi on PyTorch.
Scale-out Topology |
Distributed Framework |
Scaling Via |
Flags |
Notes |
---|---|---|---|---|
Verbs Host NIC Scaling |
PyTorch DDP |
Libfabric |
RDMAV_FORK_SAFE=1 MLX5_SCATTER_TO_CQE=0 |
See Scale-out Using Host NICs for more details. |
AWS EFA Host NIC Scaling |
PyTorch DDP |
Libfabric |
RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1 |
|
Gaudi based |
PyTorch DDP |
RDMA |
None |
Default mode = RDMA |