Scale-out Topology

Scale Out Topology

Distributed Framework

Scaling

Flags

Notes

AWS DL1 / Host NIC Scaling

Horovod

TCP/IP based scaling

HOROVOD_HIERARCHICAL _ALLREDUCE=1

MPI based Host NIC Scale Out

Horovod

Libfabric based scaling

HCCL_OVER_OFI=1

Refer to Scale-Out via Host-NIC over OFI

Horovod

TCP/IP based scaling

HCCL_OVER_TCP=1

Refer to Scale-Out via Host-NIC over TCP

TensorFlow Distributed

Libfabric based scaling

HCCL_OVER_OFI=1

Refer to Scale-Out via Host-NIC over OFI

TensorFlow Distributed

TCP/IP based scaling

HCCL_OVER_TCP=1

Refer to Scale-Out via Host-NIC over TCP

Gaudi based

Horovod

RDMA based scaling

None

Default mode = RDMA

TensorFlow Distributed

Note

  • Host NIC scaling performance is determined by the network connectivity between the hosts.

  • The behavior of setting either HCCL_OVER_TCP=1 or HCCL_OVER_OFI=1 with HOROVOD_HIERARCHICAL_ALLREDUCE=1 is undefined.

  • Using either Habana Horovod or TensorFlow Distributed for scaling is the user’s decision, based on preference and previous usage. Users looking to start scaling up from single card to multi-card or multi-server, should start with Habana Horovod. Habana Horovod has broader coverage and similar scaling efficiency in scaling up to 8 Gaudi cards is observed.