Scale-out Topology
Scale-out Topology¶
Scale Out Topology |
Distributed Framework |
Scaling |
Flags |
Notes |
---|---|---|---|---|
AWS DL1 / Host NIC Scaling |
Horovod |
MPI based scaling |
HOROVOD_HIERARCHICAL _ALLREDUCE=1 |
MPI based Host NIC Scale Out |
Horovod |
Libfabric based scaling |
RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1 |
Refer to Scale-Out via Host-NIC |
|
TensorFlow Distributed |
Libfabric based scaling |
RDMAV_FORK_SAFE=1 FI_EFA_USE_DEVICE_RDMA=1 |
Refer to Scale-Out via Host-NIC |
|
Gaudi based |
Horovod |
RDMA based scaling |
None |
Default mode = RDMA |
TensorFlow Distributed |
Note
Host NIC scaling performance is determined by the network connectivity between the hosts.
Using either Habana Horovod or TensorFlow Distributed for scaling is the user’s decision, based on preference and previous usage. Users looking to start scaling up from single card to multi-card or multi-server, should start with Habana Horovod. Habana Horovod has broader coverage and similar scaling efficiency in scaling up to 8 Gaudi cards is observed.