Scale-out Topology

Scale Out Topology

Distributed Framework

Scaling

Flags

Notes

Verbs Host NIC Scaling

Horovod

Libfabric based scaling

RDMAV_FORK_SAFE=1 MLX5_SCATTER_TO_CQE=0

See Scale-out Using Host NICs for more details.

TensorFlow Distributed

AWS EFA Host NIC Scaling

Horovod

RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1

TensorFlow Distributed

Gaudi based

Horovod

RDMA based scaling

None

Default mode = RDMA

TensorFlow Distributed

Note

  • Host NIC scaling performance is determined by the network connectivity between the hosts.

  • Using either Intel Gaudi Horovod or TensorFlow Distributed for scaling is the user’s decision, based on preference and previous usage. Users looking to start scaling up from single card to multi-card or multi-server, should start with Intel Gaudi Horovod. Intel Gaudi Horovod has broader coverage and similar scaling efficiency in scaling up to 8 Gaudi cards is observed.