Distributed Training across Multiple AWS DL1 Instances User Guide

This document describes how to run distributed training workloads on multiple DL1 Instances for AWS users. It includes distributed training support across various SynapseAI releases as well as how to configure workloads and instances for distributed training.

Distributed Training

Habana supports distributed training across multiple servers over TensorFlow and PyTorch frameworks. Cross-server communication is achieved through 400Gbs network connection between server hosts. Distributed training using Gaudi NICs across multiple servers is not offered in AWS. Instead, Host NICs is used for scaling across multiple instances. By default, DL1 instances get one Elastic Network Adapter (ENA) network interface. To receive better performance with libFabric, it is recommended to allocate DL1 instances with Elastic Fabric Adapter (EFA) network interfaces. DL1 instances support up to four network interfaces. For additional details on EFA usage and limitations, refer to EFA User Guide and EFA Limitations.

For TensorFlow, SynapseAI supports both Horovod and HPUStrategy integrated with the tf.distribute API. For PyTorch, only Distributed Data Parallel (DDP) is supported. Detailed information on distributed training across multiple servers (scale-out) and within single server (scale-up) can be found in Distributed Training with TensorFlow and Distributed Training with PyTorch.

Note

Having more than one EFA network interface prevents allocating a public IP address. To connect to the instance, one option is to assign an Elastic IP address.

Environment Variables for Different Implementations

AWS users can easily switch between different implementations by setting environment variables. The environment variables need to be set in all server instances during runtime. For further details refer to TensorFlow Scale Out Topology and PyTorch Scale Out Topology.

Running Distributed Training over Multiple DL1 Instances

An EFA requires a security group that allows all inbound and outbound traffic to and from the security group itself, similar to the below example. Follow the instructions in Prepare an EFA-enabled security group for more details.

../../_images/EFA_enabled_security_group.png

To launch multiple EFA-enabled DL1 instances follow the instructions in Launch EFA-enabled instance.

Note

To run workloads on EFA-enabled AWS instances with Linux Kernel lower than 5.13, the environment variable FI_EFA_FORK_SAFE must be set to 1. It is required by Libfabric EFA provider to work safely with software utilizing fork(). However, on Linux Kernel higher than 5.13 and rdma-core v35.0+, applications are always fork-safe where setting this variable is not required.

Each instance needs to communicate with each other via SSH during training. To allow for instances to communicate, follow the below steps to enable password-less SSH:

  1. Disable strictHostKeyChecking and enable ForwardAgent. Open ~/.ssh/config using your preferred text editor and add the following:

    Note

    This step is not required when running in a container.

    Host *
       ForwardAgent yes
       StrictHostKeyChecking no
    
  2. Generate an RSA key pair:

    ssh-keygen -t rsa -b 4096
    
  3. Change the permission of the private key:

    chmod 600 ~/.ssh/id_rsa
    chmod 600 ~/.ssh/config
    
  4. Copy the generated id_rsa.pub content to authorized_keys for each node in the cluster:

    cat id_rsa.pub > authorized_keys
    
  5. (Optional) Habana TensorFlow and Habana PyTorch dockers use port 3022 for SSH. This is the default port number configured in the training scripts respectively. mpirun might fail to establish the remote connection when there is more than one Habana docker session running. In this case, you need to set up a different SSH port. This can be done within the docker by editing the file below and adding a different port number:

    sed -i 's/#Port 22/Port <my port number>/g' /etc/ssh/sshd_config
    

    The below is an example using port 4022:

    sed -i 's/#Port 22/Port 4022/g' /etc/ssh/sshd_config
    
  6. Restart sshd service within docker:

    service ssh stop
    service ssh start
    
  7. The above steps need to be done for all instances. After that, you can verify the connectivity among instances by SSH from one instance to any other instance using the following command without being prompted for a key or password:

    ssh -p 3022 10.232.192.227
    

Build and Store Custom Docker Image for Training

To train through EFA, hccl_ofi_wrapper should be installed. This package interacts with libFabric and utilizes the underlying hardware and networking mode. For further information, refer to Scale out Host NIC OFI

The following is an example Dockerfile that builds the hccl_ofi_wrapper:

 FROM vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.10.1

 # Installs hccl_ofi_wrapper to interact with libfabric to utilize HW and networking mode (EFA)
 ARG OFI_WRAPPER_WS="/root/hccl_ofi_wrapper"
 RUN git clone "https://github.com/HabanaAI/hccl_ofi_wrapper.git" "${OFI_WRAPPER_WS}" && \
    cd "${OFI_WRAPPER_WS}" && \
    ln -s /opt/amazon/efa/lib64 /opt/amazon/efa/lib && \
    LIBFABRIC_ROOT=/opt/amazon/efa make

 ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"${OFI_WRAPPER_WS}"

Build and push image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or to the Amazon ECR Getting Started Guide.

Examples

You can find all steps to run distributed training across multiple servers using host NICs in the following examples: