2. Quick Start for Distributed Training across Multiple AWS DL1 Instances

2.1. Overview

This document describes how to run distributed training workloads on multiple DL1 Instances for AWS users. It includes distributed training support across various SynapseAI releases as well as how to configure workloads and instances for distributed training.

2.2. Distributed Training

Habana supports distributed training across multiple servers over TensorFlow and PyTorch frameworks. Cross-server communication is achieved through 400Gbs network connection between server hosts. Distributed training using Gaudi NICs across multiple servers is not offered in AWS. Instead, Host NICs is used for scaling across multiple instances.

For TensorFlow, SypanseAI supports both Horovod and HPUStrategy integrated with the tf.distrbute API. For PyTorch, only DistributedDataParallel (DDP) is supported. Detailed information on distributed training across multiple servers (scale-out) and within single server (scale-up) can be found in Distributed Training with TensorFlow and PyTorch User Guide.

2.3. Distributed Training Support Matrix

SynapseAI support for distributed training across multiple instances is continuously evolving. Each release contains new capabilities as the table below outlines. AWS users can choose the implementation that best suits their needs.

The support matrix table below outlines the different capabilities available in the recent SynapseAI releases for TensorFlow.

SynapseAI Version

0.15.4

1.0.1

1.1.0

1.1.1

Horovod Hierarchical AllReduce using HCCL and MPI

Yes

Yes

Yes

Yes

Horovod using HCCL over TCP

No

Yes

Yes

Yes

tf.distribute using HCCL over TCP

No

No

Yes

Yes

The support matrix table below outlines the different capabilities available in the recent SynapseAI releases for PyTorch.

SynapseAI Version

0.15.4

1.0.1

1.1.0

1.1.1

DDP using HCCL over TCP

No

No

Yes

Yes

2.4. Environment Variables for Different Implementations

AWS users can easily switch between different implementations by setting the environment variables listed in the below table. The environment variables need to be set in all server instances during runtime.

TensorFlow Distributed Framework

Scaling-out

Flags

Notes

Horovod

Host NIC using MPI

HOROVOD_HIERARCHICAL_ALLREDUCE=1

Horovod

Host NIC using TCP

HCCL_OVER_TCP=1

Refer to Scale-Out via Host-NIC over TCP

TensorFlow Distributed

Host NIC using TCP

HCCL_OVER_TCP=1

Refer to Scale-Out via Host-NIC over TCP

PyTorch Distributed Framework

Scaling

Flags

Notes

torch.distributed

Host NIC using TCP

HCCL_OVER_TCP=1

Refer to Configuration Knobs

2.5. Running Distributed Training over Multiple DL1 Instances

To execute training across multiple servers, multiple DL1 instances should be launched. The screenshot below illustrates how to select four instances at the same time.

../_images/Configure_Instance_Details.png

Each instance needs to communicate with each other via SSH during training. To allow for instances to communicate, follow the below steps:

  1. Make sure IP addresses of all server instances are in the same security group and all traffic is allowed among them. The screenshot below shows an example of the security group setting for two instances. Two IP addresses, 10.232.192.127 and 10.232.192.227, are the private IP addresses of two instances. You will only see these IP addresses once you launch the instances. Therefore, after launch, you’ll need to go back to add these two rules for these two IP addresses.

../_images/Security_Group_Settings.png
  1. Set each instance to allow all other instance to SSH to it.

    1. Within docker, run the command below to generate a key:

    mkdir ~/.ssh
    
    cd ~/.ssh
    
    ssh-keygen -t rsa -b 4096
    
    1. Copy the generated id_rsa.pub content to authorized_keys using the command below:

    cat id_rsa.pub > authorized_keys
    
    1. Copy id_rsa.pub in each instance to authorized_keys in all other instances, i.e. 10.232.192.227 in this example.

    2. Add all servers including itself to known_hosts using the commands below:

    ssh-keyscan -p 3022 -H 10.232.192.127 > ~/.ssh/known_hosts
    
    ssh-keyscan -p 3022 -H 10.232.192.227 >> ~/.ssh/known_hosts
    
    1. By default, the Habana TensorFlow docker uses port 3022 for SSH and Habana Pytorch docker uses port 22 for SSH. These are the defaults port configured in the training scripts respectively. Sometimes mpirun can fail to establish the remote connection when there is more than one Habana docker session running. If this happens, you need set up a different SSH port. This can be done within docker by editing the file below to add a different port number, e.g. 4022.

    vi /etc/ssh/sshd_config
    
    1. Then restart sshd service within docker:

    service ssh stop
    
    service ssh start
    
  2. The above steps need to be done for all instances. After that, you can verify the connectivity among instances by SSH from one instance to any other instance using the following command:

ssh -p 3022 10.232.192.227

2.6. Examples

You can find all steps to run distributed training across multiple servers using host NICs in the following examples: