Creating an Multi-Node Parallel (MNP) Compatible Docker Image

A key requirement to run AWS Batch is a Dockerized image containing the application, code, and scripts. Batch provides the benefit of Docker-to-Docker communication across nodes through ECS task networking. Multi-node parallel jobs set up environment variables in the ECS Containers that establish meta data for determining the number of nodes and master child relationships.

To run MNP in the ECS containers, create a Dockerfile with the following key elements:

  1. Prepare a hostfile, run_batch.sh, with the IP addresses of the cluster.

  2. Prepare MNIST training scripts, run_mnist.sh, to launch a simple distributed MNIST training example through mpirun.

  3. Set up passwordless SSH to bypass ssh authentication for containers using this image.

  4. Install hccl_ofi_wrapper package to train through EFA. This package interacts with libFabric and utilizes the underlying hardware and networking mode. For further information, refer to Scale out Host NIC OFI.

  5. Set up Supervisord, supervisord.conf, to run SSH on startup and launch run_batch.sh as the CMD.

Preparing a Hostfile

The following code block gathers the IP addresses of the cluster into a hostfile on the master node and starts training. This automates collecting IP address to run distributed training and can be added as part of the CMD structure of a Docker Image.

The script launches a simple distributed MNIST training example. You can run other distributed training models by updating the wait_for_nodes() function as highlighted in the below code example.

Save the following code as run_batch.sh.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
 #!/bin/bash

 BASENAME="${0##*/}"
 log () {
     echo "${BASENAME} - ${1}"
 }

 HOST_FILE_PATH="/tmp/hostfile"
 AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"

 usage () {
     if [ "${#@}" -ne 0 ]; then
         log "* ${*}"
         log
     fi
     cat <<ENDUSAGE
     Usage:
     export AWS_BATCH_JOB_NODE_INDEX=0
     export AWS_BATCH_JOB_NUM_NODES=10
     export AWS_BATCH_JOB_MAIN_NODE_INDEX=0
     export AWS_BATCH_JOB_ID=string
     ./run_batch.sh
     ENDUSAGE

     error_exit
 }

 # Standard function to print an error and exit with a failing return code
 error_exit () {
     log "${BASENAME} - ${1}" >&2
     log "${2:-1}" > $AWS_BATCH_EXIT_CODE_FILE
 }

 # Set child by default switch to main if on main node container
 NODE_TYPE="child"
 if [ "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" == "${AWS_BATCH_JOB_NODE_INDEX}" ]; then
     log "Running synchronize as the main node"
     NODE_TYPE="main"
 fi

 # wait for all nodes to report
 wait_for_nodes () {
     log "Running as master node"

     touch $HOST_FILE_PATH
     ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)

     if [ -x "$(command -v hl-smi)" ] ; then
         NUM_HPUS=$(ls -l /dev/hl[0-9] | wc -l)
         availablecores=$NUM_HPUS
     else
         availablecores=$(nproc)
     fi

     log "master details -> $ip:$availablecores"
     echo "$ip" >> $HOST_FILE_PATH

     lines=$(sort $HOST_FILE_PATH|uniq|wc -l)
     while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
     do
         log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, check again in 1 second"
         sleep 1
         lines=$(sort $HOST_FILE_PATH|uniq|wc -l)
     done
     log "All nodes successfully joined"

     # remove duplicates if there are any.
     awk '!a[$0]++' $HOST_FILE_PATH > ${HOST_FILE_PATH}-deduped
     cat $HOST_FILE_PATH-deduped
     log "executing main training workflow"

     # Run MNIST Training
     MASTER_IP=$ip HOSTSFILE=$HOST_FILE_PATH-deduped NUM_NODES=$lines /run_mnist.sh

     log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
     echo "0" > $AWS_BATCH_EXIT_CODE_FILE
     kill  $(cat /tmp/supervisord.pid)
     exit 0
 }

 # Sends IPs to Master Hostfile
 report_to_master () {
     # get own ip
     ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)

     if [ -x "$(command -v hl-smi)" ] ; then
         NUM_HPUS=$(ls -l /dev/hl[0-9] | wc -l)
         availablecores=$NUM_HPUS

     else
         availablecores=$(nproc)
     fi

     log "I am a child node -> $ip:$availablecores, reporting to the master node -> ${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS}"
     until echo "$ip" | ssh ${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS} "cat >> $HOST_FILE_PATH"
     do
         echo "Sleeping 5 seconds and trying again"
         sleep 5
     done
     log "done! goodbye"
     exit 0
 }

 # Main - dispatch user request to appropriate function
 log $NODE_TYPE
 case $NODE_TYPE in
 main)
     wait_for_nodes "${@}"
     ;;

 child)
     report_to_master "${@}"
     ;;

 *)
     log $NODE_TYPE
     usage "Could not determine node type. Expected (main/child)"
     ;;
 esac

Preparing MNIST Training Scripts

Save the following code as run_mnist.sh. This will launch a simple distributed MNIST training example through mpirun.

#!/bin/bash

###############################################################################################
# Example: Training MNIST on multinode with 8 card each
###############################################################################################

HOSTSFILE=$HOSTSFILE

NUM_NODES=$NUM_NODES
NGPU_PER_NODE=8
let N_CARDS=$NUM_NODES*$NGPU_PER_NODE

echo "Starting MNIST Training"

mpirun --allow-run-as-root \
    --hostfile $HOSTSFILE \
    -np ${N_CARDS} \
    -x LD_LIBRARY_PATH="/root/hccl_ofi_wrapper:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:${LD_LIBRARY_PATH}" \
    -x PYTHONPATH="/usr/lib/habanalabs:/Model-References" \
    -x MPI_ROOT="/opt/amazon/openmpi" \
    -x HCCL_SOCKET_IFNAME="eth0" \
    -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
    --mca btl_tcp_if_include "eth0" \
    --prefix /opt/amazon/openmpi \
    --mca plm_rsh_args "-p3022" \
    --map-by ppr:4:socket:PE=6 \
    python3 /Model-References/TensorFlow/examples/hello_world/example_hvd.py

Setting up Passwordless SSH

To achieve passwordless ssh, generate a SSH key and append it to the ~/.ssh/authorized_keys file inside a Docker Image. This bypasses ssh authentication for containers using this image.

#################################
# SSH Setup
#################################
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo "       IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
&& echo "       StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo "       Port 3022" >> ${SSHDIR}/config \
&& echo 'Port 3022' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
&& echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/*
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

Setting up Supervisord

Save the following as supervisord.conf. This configuration kicks off the sshd service, creates a Hostfile with IPs of the cluster, and kicks off our distributed training script.

[supervisord]
logfile = /tmp/supervisord.log
logfile_maxbytes = 50MB
logfile_backups=10
loglevel = info
pidfile = /tmp/supervisord.pid
nodaemon = false
minfds = 1024
minprocs = 200
umask = 022
identifier = supervisor
directory = /tmp
nocleanup = true
childlogdir = /tmp
strip_ansi = false

[program:sshd]
command=/usr/sbin/sshd -D -f /root/.ssh/sshd_config -h /root/.ssh/ssh_host_rsa_key
stdout_logfile=/dev/fd/1
stdout_logfile_maxbytes=0
redirect_stderr=true
autorestart=true
stopsignal=INT


[program:synchronize]
command=/run_batch.sh
stdout_logfile=/dev/fd/1
stdout_logfile_maxbytes=0
redirect_stderr=true
autorestart=false
startsecs=0
stopsignal=INT
exitcodes=0,2

Creating a Dockerfile

To kick off the scripts when the container gets launched, create a entry-point.sh with the following:

#!/bin/bash

# Launch supervisor
BASENAME="${0##*/}"

log () {
    echo "${BASENAME} - ${1}"
}

AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"
supervisord -n -c "/conf/supervisord.conf"

# if supervisor dies then read exit code from file we don't want to return the supervisors exit code
log "Reading exit code from batch script stored at $AWS_BATCH_EXIT_CODE_FILE"
if [ ! -f $AWS_BATCH_EXIT_CODE_FILE ]; then
    echo "Exit code file not found , returning with exit code 1!" >&2
    exit 1
fi

exit $(($(cat $AWS_BATCH_EXIT_CODE_FILE)))

In a file named Dockerfile, bring all the above scripts together:

FROM vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.10.1:latest

ENV HOME /root
RUN echo $HOME

#################################
# Clone Model References
#################################
RUN git clone -b 1.7.1 https://github.com/HabanaAI/Model-References.git /Model-References
ENV PYTHONPATH=$PYTHONPATH:/Model-References

######################################################################################################
# Installs hccl_ofi_wrapper to interact with libfabric to utilize HW and networking mode (EFA)
######################################################################################################
ARG OFI_WRAPPER_WS="$HOME/hccl_ofi_wrapper"
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"${OFI_WRAPPER_WS}"

RUN git clone "https://github.com/HabanaAI/hccl_ofi_wrapper.git" "${OFI_WRAPPER_WS}" && \
    cd "${OFI_WRAPPER_WS}" && \
    ln -s /opt/amazon/efa/lib64 /opt/amazon/efa/lib && \
    LIBFABRIC_ROOT=/opt/amazon/efa make

#################################
# Supervisord Install
#################################
RUN apt-get update
RUN pip install supervisor

#################################
# SSH Setup
#################################
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo "       IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
&& echo "       StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo "       Port 3022" >> ${SSHDIR}/config \
&& echo 'Port 3022' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
&& echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/*
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

#################################
# Copy Necessary Scripts
#################################
COPY entry-point.sh /entry-point.sh
COPY run_batch.sh /run_batch.sh
COPY run_mnist.sh /run_mnist.sh
COPY supervisord.conf /conf/supervisord.conf

CMD /entry-point.sh

Build and push image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or to the Amazon ECR Getting Started Guide.