Scale-out Topology

Scale Out Topology

Distributed Framework

Scaling

Flags

Notes

AWS DL1 / Host NIC Scaling

Horovod

MPI based scaling

HOROVOD_HIERARCHICAL _ALLREDUCE=1

MPI based Host NIC Scale Out

Horovod

Libfabric based scaling

RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1

Refer to Scale-Out via Host-NIC

TensorFlow Distributed

Libfabric based scaling

RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1

Refer to Scale-Out via Host-NIC

Gaudi based

Horovod

RDMA based scaling

None

Default mode = RDMA

TensorFlow Distributed

Note

  • Host NIC scaling performance is determined by the network connectivity between the hosts.

  • Using either Habana Horovod or TensorFlow Distributed for scaling is the user’s decision, based on preference and previous usage. Users looking to start scaling up from single card to multi-card or multi-server, should start with Habana Horovod. Habana Horovod has broader coverage and similar scaling efficiency in scaling up to 8 Gaudi cards is observed.