Creating a Multi-Node Parallel (MNP) Compatible Docker Image
On this Page
Creating a Multi-Node Parallel (MNP) Compatible Docker Image¶
A key requirement to run AWS Batch is a Dockerized image containing the application, code, and scripts. Batch provides the benefit of Docker-to-Docker communication across nodes through ECS task networking. Multi-node parallel jobs set up environment variables in the ECS Containers that establish meta data for determining the number of nodes and master child relationships.
To run MNP in the ECS containers, create a Dockerfile with the following key elements:
Prepare a
run_batch.sh
hostfile with the IP addresses of the cluster.Prepare
run_mnist.sh
training scripts to launch a simple distributed MNIST training example throughmpirun
.Set up a passwordless SSH to bypass SSH authentication for containers using this image.
Set up
supervisord.conf
to run SSH on startup and launchrun_batch.sh
as the CMD.
Preparing a Hostfile¶
The following code block gathers the IP addresses of the cluster into a hostfile on the master node and starts training. This automates collecting IP address to run distributed training and can be added as part of the CMD structure of a Docker image.
The script launches a simple distributed MNIST training example. You can run other distributed training models by updating the wait_for_nodes()
function as highlighted in the below code example.
Save the following code as run_batch.sh
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | #!/bin/bash
BASENAME="${0##*/}"
log () {
echo "${BASENAME} - ${1}"
}
HOST_FILE_PATH="/tmp/hostfile"
AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"
usage () {
if [ "${#@}" -ne 0 ]; then
log "* ${*}"
log
fi
cat <<ENDUSAGE
Usage:
export AWS_BATCH_JOB_NODE_INDEX=0
export AWS_BATCH_JOB_NUM_NODES=10
export AWS_BATCH_JOB_MAIN_NODE_INDEX=0
export AWS_BATCH_JOB_ID=string
./run_batch.sh
ENDUSAGE
error_exit
}
# Standard function to print an error and exit with a failing return code
error_exit () {
log "${BASENAME} - ${1}" >&2
log "${2:-1}" > $AWS_BATCH_EXIT_CODE_FILE
}
# Set child by default switch to main if on main node container
NODE_TYPE="child"
if [ "${AWS_BATCH_JOB_MAIN_NODE_INDEX}" == "${AWS_BATCH_JOB_NODE_INDEX}" ]; then
log "Running synchronize as the main node"
NODE_TYPE="main"
fi
# wait for all nodes to report
wait_for_nodes () {
log "Running as master node"
touch $HOST_FILE_PATH
ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
if [ -x "$(command -v hl-smi)" ] ; then
NUM_HPUS=$(ls -l /dev/hl[0-9] | wc -l)
availablecores=$NUM_HPUS
else
availablecores=$(nproc)
fi
log "master details -> $ip:$availablecores"
echo "$ip" >> $HOST_FILE_PATH
lines=$(sort $HOST_FILE_PATH|uniq|wc -l)
while [ "$AWS_BATCH_JOB_NUM_NODES" -gt "$lines" ]
do
log "$lines out of $AWS_BATCH_JOB_NUM_NODES nodes joined, check again in 1 second"
sleep 1
lines=$(sort $HOST_FILE_PATH|uniq|wc -l)
done
log "All nodes successfully joined"
# remove duplicates if there are any.
awk '!a[$0]++' $HOST_FILE_PATH > ${HOST_FILE_PATH}-deduped
cat $HOST_FILE_PATH-deduped
log "executing main training workflow"
# Run MNIST training
MASTER_IP=$ip HOSTSFILE=$HOST_FILE_PATH-deduped NUM_NODES=$lines /run_mnist.sh
log "done! goodbye, writing exit code to $AWS_BATCH_EXIT_CODE_FILE and shutting down my supervisord"
echo "0" > $AWS_BATCH_EXIT_CODE_FILE
kill $(cat /tmp/supervisord.pid)
exit 0
}
# Sends IPs to Master Hostfile
report_to_master () {
# get own ip
ip=$(/sbin/ip -o -4 addr list eth0 | awk '{print $4}' | cut -d/ -f1)
if [ -x "$(command -v hl-smi)" ] ; then
NUM_HPUS=$(ls -l /dev/hl[0-9] | wc -l)
availablecores=$NUM_HPUS
else
availablecores=$(nproc)
fi
log "I am a child node -> $ip:$availablecores, reporting to the master node -> ${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS}"
until echo "$ip" | ssh ${AWS_BATCH_JOB_MAIN_NODE_PRIVATE_IPV4_ADDRESS} "cat >> $HOST_FILE_PATH"
do
echo "Sleeping 5 seconds and trying again"
sleep 5
done
log "done! goodbye"
exit 0
}
# Main - dispatch user request to appropriate function
log $NODE_TYPE
case $NODE_TYPE in
main)
wait_for_nodes "${@}"
;;
child)
report_to_master "${@}"
;;
*)
log $NODE_TYPE
usage "Could not determine node type. Expected (main/child)"
;;
esac
|
Preparing MNIST Training Scripts¶
Save the following code as run_mnist.sh
. This will launch a simple distributed MNIST
training example through mpirun
.
#!/bin/bash
###############################################################################################
# Example: Training MNIST on multinode with 8 card each
###############################################################################################
HOSTSFILE=$HOSTSFILE;
MASTER_ADDR=$MASTER_IP;
NUM_NODES=$NUM_NODES;
CARDS_PER_NODE=8;
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
echo "Starting MNIST Training";
MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world;
MNIST_CMD="python $MODEL_PATH/mnist.py \
--batch-size=64 \
--epochs=1 \
--lr=1.0 \
--gamma=0.7 \
--hpu";
mpirun -np ${N_CARDS} \
--allow-run-as-root \
--hostfile $HOSTSFILE \
--mca plm_rsh_args "-p3022" \
--map-by ppr:4:socket:PE=6 \
--prefix $MPI_ROOT \
-x RDMAV_FORK_SAFE=1 \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x MASTER_ADDR=$MASTER_ADDR \
$MNIST_CMD;
Setting up Passwordless SSH¶
To achieve passwordless SSH, generate a SSH key and append it to the ~/.ssh/authorized_keys
file inside a Docker Image. This bypasses
SSH authentication for containers using this image.
#################################
# SSH Setup
#################################
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo " IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
&& echo " StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo " Port 3022" >> ${SSHDIR}/config \
&& echo 'Port 3022' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
&& echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/*
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa
Setting up Supervisord¶
Save the following as supervisord.conf
. This configuration kicks off the sshd service, creates a hostfile with the IPs
of the cluster, and kicks off our distributed training script.
[supervisord]
logfile = /tmp/supervisord.log
logfile_maxbytes = 50MB
logfile_backups=10
loglevel = info
pidfile = /tmp/supervisord.pid
nodaemon = false
minfds = 1024
minprocs = 200
umask = 022
identifier = supervisor
directory = /tmp
nocleanup = true
childlogdir = /tmp
strip_ansi = false
[program:sshd]
command=/usr/sbin/sshd -D -f /root/.ssh/sshd_config -h /root/.ssh/ssh_host_rsa_key
stdout_logfile=/dev/fd/1
stdout_logfile_maxbytes=0
redirect_stderr=true
autorestart=true
stopsignal=INT
[program:synchronize]
command=/run_batch.sh
stdout_logfile=/dev/fd/1
stdout_logfile_maxbytes=0
redirect_stderr=true
autorestart=false
startsecs=0
stopsignal=INT
exitcodes=0,2
Creating a Dockerfile¶
To kick off the scripts when the container gets launched, create an entry-point.sh
with the following:
#!/bin/bash
# Launch supervisor
BASENAME="${0##*/}"
log () {
echo "${BASENAME} - ${1}"
}
AWS_BATCH_EXIT_CODE_FILE="/tmp/batch-exit-code"
supervisord -n -c "/conf/supervisord.conf"
# if supervisor dies then read exit code from file we don't want to return the supervisors exit code
log "Reading exit code from batch script stored at $AWS_BATCH_EXIT_CODE_FILE"
if [ ! -f $AWS_BATCH_EXIT_CODE_FILE ]; then
echo "Exit code file not found , returning with exit code 1!" >&2
exit 1
fi
exit $(($(cat $AWS_BATCH_EXIT_CODE_FILE)))
In a file named Dockerfile
, bring all the above scripts together:
FROM vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
ENV HOME /root
RUN echo $HOME
#################################
# Clone Model References
#################################
RUN git clone -b 1.19.0 https://github.com/HabanaAI/Model-References.git /Model-References
ENV PYTHONPATH=$PYTHONPATH:/Model-References
#################################
# Supervisord Install
#################################
RUN apt-get update
RUN pip install supervisor
#################################
# SSH Setup
#################################
RUN /usr/bin/ssh-keygen -A
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo " IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
&& echo " StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo " Port 3022" >> ${SSHDIR}/config \
&& echo 'Port 3022' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
&& echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/*
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa
#################################
# Copy Necessary Scripts
#################################
COPY entry-point.sh /entry-point.sh
COPY run_batch.sh /run_batch.sh
COPY run_mnist.sh /run_mnist.sh
COPY supervisord.conf /conf/supervisord.conf
CMD /entry-point.sh
Build and push the image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or to the Amazon ECR Getting Started Guide.