Advanced Model Training Batch Example: ResNet50
On this Page
Advanced Model Training Batch Example: ResNet50¶
An advanced use case involves a large dataset. AWS Elastic File Service (EFS) provides a simple, serverless, scalable storage system perfectly suited for dataset storage. This sections focuses on:
Creating a Dockerfile with ResNet50 dependencies
Updating AWS Batch shell scripts to run ResNet50 training scripts from Model References
Building and Pushing Batch ResNet50 Image to ECR
Creating and submitting a Job Definition
ResNet50 model requires dependencies to be installed. The below are instructions for setting up a ResNet dataset, creating, and submitting a ResNet50 Job on AWS Batch only.
ResNet50 Docker Image¶
The standard Dockerfile example needs modifications to properly run ResNet50. The changes are highlighted below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | FROM vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
ENV HOME /root
RUN echo $HOME
#################################
# Clone Model References
#################################
RUN git clone -b 1.19.0 https://github.com/HabanaAI/Model-References.git /Model-References && \
python3 -m pip install -r /Model-References/PyTorch/computer_vision/classification/torchvision/requirements.txt
ENV PYTHONPATH=$PYTHONPATH:/Model-References
#################################
# Supervisord Install
#################################
RUN apt-get update
RUN pip install supervisor
#################################
# SSH Setup
#################################
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo " IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
&& echo " StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo " Port 3022" >> ${SSHDIR}/config \
&& echo 'Port 3022' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
&& echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/*
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa
#################################
# Copy Necessary Scripts
#################################
COPY entry-point.sh /entry-point.sh
COPY run_batch.sh /run_batch.sh
COPY run_resnet50.sh /run_resnet50.sh
COPY supervisord.conf /conf/supervisord.conf
CMD /entry-point.sh
|
ResNet50 Training Script¶
Save the following code as run_resnet50.sh
. This will launch a distributed ResNet50
training example through mpirun
.
#!/bin/bash
###############################################################################################
# Example: Training RESNET50 on multinode with 8 card each
###############################################################################################
HOSTSFILE=$HOSTSFILE;
MASTER_ADDR=$MASTER_IP;
NUM_NODES=$NUM_NODES;
CARDS_PER_NODE=8;
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
MODEL_PATH=/Model-References/PyTorch/computer_vision/classification/torchvision/;
DATA_DIR=/data/pytorch/imagenet/ILSVRC2012;
CMD="python $MODEL_PATH/train.py \
--data-path=${DATA_DIR} \
--model=resnet50 \
--device=hpu \
--batch-size=256 \
--epochs=90 \
--print-freq=1 \
--output-dir=. \
--seed=123 \
--autocast \
--custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 \
--custom-lr-milestones 1 2 3 4 30 60 80 \
--deterministic \
--dl-time-exclude=False";
# Configure multinode
if [ "$NUM_NODES" -ne "1" -a -f "$HOSTSFILE" ]
then
MULTINODE_CMD="--hostfile $HOSTSFILE";
fi
mpirun -np ${N_CARDS} \
--allow-run-as-root \
$MULTINODE_CMD \
--prefix $MPI_ROOT \
--mca plm_rsh_args "-p 3022" \
--map-by ppr:4:socket:PE=6 \
--bind-to core \
-x PYTHONPATH="/usr/lib/habanalabs:/Model-References" \
-x RDMAV_FORK_SAFE=1 \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x MASTER_ADDR=$MASTER_ADDR \
$CMD;
Update run_batch.sh script to run ResNet50 Training¶
The wait_for_nodes()
in run_batch.sh
function has an area that can be treated as a template for running
other distributed training models. Update this area to run the run_resnet50.sh
:
MASTER_IP=$ip HOSTSFILE=$HOST_FILE_PATH-deduped NUM_NODES=$lines /run_resnet50.sh
Build and Push Image to ECR¶
Build and push a resnet50_batch_training
image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances.
For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or
to the Amazon ECR Getting Started Guide.
Upload Data to Elastic File System (EFS)¶
Follow instructions for ResNet Data Generation and upload that to Amazon EFS. For more details on how to use EFS, refer to Amazon EFS Getting Started Guide. This will allow the cluster ease of access to the training dataset.
Create AWS Batch Job Definition¶
To use Amazon EFS on AWS Batch, it requires specifying the volume and mount point configurations in the job definition.
Create
resnet50_jd.json
with the following configuration and update the placeholders:{ "jobDefinitionName": "resnet50_jd", "type": "multinode", "nodeProperties": { "numNodes": 2, "mainNode": 0, "nodeRangeProperties": [ { "targetNodes": "0:", "container": { "image": "IMAGE_NAME", "command": [], "jobRoleArn": "TASK_EXEC_ROLE", "resourceRequirements": [ { "type": "MEMORY", "value": "760000" }, { "type": "VCPU", "value": "96" } ], "mountPoints": [ { "sourceVolume": "myEfsVolume", "containerPath": "/data", "readOnly": true } ], "volumes": [ { "name": "myEfsVolume", "efsVolumeConfiguration": { "fileSystemId": "fs-xxxxxxxx", "rootDirectory": "/path/to/my/data", } } ], "environment": [], "ulimits": [], "instanceType": "dl1.24xlarge", "linuxParameters": { "devices": [ { "hostPath": "/dev/infiniband/uverbs0", "containerPath": "/dev/infiniband/uverbs0", "permissions": [ "READ", "WRITE", "MKNOD" ] }, { "hostPath": "/dev/infiniband/uverbs1", "containerPath": "/dev/infiniband/uverbs1", "permissions": [ "READ", "WRITE", "MKNOD" ] }, { "hostPath": "/dev/infiniband/uverbs2", "containerPath": "/dev/infiniband/uverbs2", "permissions": [ "READ", "WRITE", "MKNOD" ] }, { "hostPath": "/dev/infiniband/uverbs3", "containerPath": "/dev/infiniband/uverbs3", "permissions": [ "READ", "WRITE", "MKNOD" ] }, ] }, "privileged": true } } ] } }
PlaceHolder
Replace
IMAGE_NAME
xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/resnet50_batch_training:v1
TASK_EXEC_ROLE
arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole
EFS_ID
fs-xxxxxxxx
Create a job definition:
aws batch register-job-definition --cli-input-json file://resnet50_jd.json # Expected Results { "jobDefinitionName": "resnet50_jd", "jobDefinitionArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-definition/resnet50_jd:1", "revision": 1 }
Submit AWS Batch Job¶
To submit a job, run the following command:
aws batch submit-job --job-name resnet50_batch --job-definition resnet50_jd --job-queue dl1_mnp_jq --node-overrides numNodes=2
# Expected Results
{
"jobArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job/a434b6e9-5fda-415d-befb-079b04c95a97",
"jobName": "resnet50_batch",
"jobId": "a434b6e9-5fda-415d-befb-079b04c95a97"
}
Note
Jobs status can also be submitted/viewed through the AWS Batch Console
Observe Submitted AWS Batch Job Logs¶
AWS Batch creates a Log that is hosted in CloudWatch. Follow View Log Data sent to CloudWatch Logs for specific instructions.