Advanced Model Training Batch Example: ResNet50 Keras
On this Page
Advanced Model Training Batch Example: ResNet50 Keras¶
An advanced use case involves a large dataset. AWS Elastic File Service (EFS) provides a simple, serverless, scalable storage system perfectly suited for dataset storage.
This sections focuses on:
Creating a Dockerfile with ResNet50 Keras dependencies
Updating AWS Batch shell scripts to run ResNet50 Keras training scripts from Model-References
Building and Pushing Batch ResNet50 Keras Image to ECR
Creating and submitting a Job Definition
ResNet Keras model requires dependencies to be installed. The below are instructions for setting up a ResNet Keras dataset, creating, and submitting a ResNet50 Keras Job on AWS Batch only.
ResNet50 Keras Docker Image¶
The standard Dockerfile example needs modifications to properly run ResNet50 Keras. The changes are highlighted below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | FROM vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1:latest
ENV HOME /root
RUN echo $HOME
#################################
# Clone Model References
#################################
RUN git clone -b 1.11.0 https://github.com/HabanaAI/Model-References.git /Model-References && \
python3 -m pip install -r /Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/requirements.txt
ENV PYTHONPATH=$PYTHONPATH:/Model-References
#################################
# Supervisord Install
#################################
RUN apt-get update
RUN pip install supervisor
#################################
# SSH Setup
#################################
ENV SSHDIR $HOME/.ssh
RUN mkdir -p ${SSHDIR} \
&& touch ${SSHDIR}/sshd_config \
&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
&& cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
&& echo " IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
&& echo " StrictHostKeyChecking no" >> ${SSHDIR}/config \
&& echo " Port 3022" >> ${SSHDIR}/config \
&& echo 'Port 3022' >> ${SSHDIR}/sshd_config \
&& echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
&& echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
&& chmod -R 600 ${SSHDIR}/*
RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa
#################################
# Copy Necessary Scripts
#################################
COPY entry-point.sh /entry-point.sh
COPY run_batch.sh /run_batch.sh
COPY run_resnet50_keras.sh /run_resnet50_keras.sh
COPY supervisord.conf /conf/supervisord.conf
CMD /entry-point.sh
|
ResNet50 Keras Training Script¶
Save the following code as run_resnet50_keras.sh
. This will launch a distributed ResNet50 Keras
training example through mpirun.
#!/bin/bash
###############################################################################################
# Example: Training RESNET50 on multinode with 8 card each
###############################################################################################
model_path=/Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/
NUM_NODES=$NUM_NODES
NGPU_PER_NODE=8
MASTER_ADDR=$MASTER_IP
let N_CARDS=$NUM_NODES*$NGPU_PER_NODE
DATA_DIR=/data/tensorflow/imagenet2012/tf_records
HOSTSFILE=$HOSTSFILE
CMD="/usr/bin/python3 -u ${model_path}/resnet_ctl_imagenet_main.py \
-dt bf16 \
-dlit bf16 \
-bs 256 \
-te 40 \
-ebe 40 \
--use_horovod \
--data_dir ${DATA_DIR} \
--optimizer LARS \
--base_learning_rate 9.5 \
--warmup_epochs 3 \
--lr_schedule polynomial \
--label_smoothing 0.1 \
--weight_decay 0.0001 \
--single_l2_loss_op "
# Configure multinode
if [ "$NUM_NODES" -ne "1" -a -f "$HOSTSFILE" ]
then
MULTINODE_CMD="--hostfile $HOSTSFILE"
fi
mpirun --allow-run-as-root \
$MULTINODE_CMD \
-np ${N_CARDS} \
-x LD_LIBRARY_PATH="/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:${LD_LIBRARY_PATH}" \
-x PYTHONPATH="/usr/lib/habanalabs:/Model-References" \
-x HCCL_COMM_ID="$MASTER_ADDR:9696" \
-x HCCL_SOCKET_IFNAME="eth0" \
-x RDMAV_FORK_SAFE=1 \
-x FI_EFA_USE_DEVICE_RDMA=1 \
--mca btl_tcp_if_include "eth0" \
-x MASTER_ADDR=$MASTER_ADDR \
-x MASTER_PORT=12345 \
--prefix /opt/amazon/openmpi \
--mca plm_rsh_args "-p 3022" \
--map-by ppr:4:socket:PE=6 \
--bind-to core \
$CMD
Update run_batch.sh script to run ResNet50 Training¶
The wait_for_nodes()
in run_batch.sh
function has an area that can be treated as a template for running
other distributed training models. Update this area to run the run_resnet50_keras.sh
:
MASTER_IP=$ip HOSTSFILE=$HOST_FILE_PATH-deduped NUM_NODES=$lines /run_resnet50_keras.sh
Build and Push Image to ECR¶
Update
config.properties
to reflect new ResNet50 Image modifications:
registry=xxxxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com
region=us-west-2
image_name=resnet50_keras_batch_training
image_tag=v1
Field |
Description |
---|---|
registry |
Update xxxx to match your AWS Account Number |
Build and push Image:
./build.sh && ./push.sh
Upload Data to Elastic File System (EFS)¶
Follow instructions for Resnet Keras Data Generation and upload that to Amazon EFS. For more details on how to use EFS, refer to Amazon EFS Getting Started Guide. This will allow the cluster ease of access to the training dataset.
Create AWS Batch Job Definition¶
To use Amazon EFS on AWS Batch, it requires specifying the volume and mount point configurations in the job definition.
Create
resnet50_keras_jd.json
with the following configuration and update the placeholders:
{
"jobDefinitionName": "resnet50_jd",
"type": "multinode",
"nodeProperties": {
"numNodes": 2,
"mainNode": 0,
"nodeRangeProperties": [
{
"targetNodes": "0:",
"container": {
"image": "IMAGE_NAME",
"command": [],
"jobRoleArn": "TASK_EXEC_ROLE",
"resourceRequirements": [
{
"type": "MEMORY",
"value": "760000"
},
{
"type": "VCPU",
"value": "96"
}
],
"mountPoints": [
{
"sourceVolume": "myEfsVolume",
"containerPath": "/data",
"readOnly": true
}
],
"volumes": [
{
"name": "myEfsVolume",
"efsVolumeConfiguration": {
"fileSystemId": "fs-xxxxxxxx",
"rootDirectory": "/path/to/my/data",
}
}
],
"environment": [],
"ulimits": [],
"instanceType": "dl1.24xlarge",
"linuxParameters": {
"devices": [
{
"hostPath": "/dev/infiniband/uverbs0",
"containerPath": "/dev/infiniband/uverbs0",
"permissions": [
"READ",
"WRITE",
"MKNOD"
]
},
{
"hostPath": "/dev/infiniband/uverbs1",
"containerPath": "/dev/infiniband/uverbs1",
"permissions": [
"READ",
"WRITE",
"MKNOD"
]
},
{
"hostPath": "/dev/infiniband/uverbs2",
"containerPath": "/dev/infiniband/uverbs2",
"permissions": [
"READ",
"WRITE",
"MKNOD"
]
},
{
"hostPath": "/dev/infiniband/uverbs3",
"containerPath": "/dev/infiniband/uverbs3",
"permissions": [
"READ",
"WRITE",
"MKNOD"
]
},
]
},
"privileged": true
}
}
]
}
}
PlaceHolder |
Replace |
---|---|
IMAGE_NAME |
xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/resnet50_keras_batch_training:v1 |
TASK_EXEC_ROLE |
arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole |
EFS_ID |
fs-xxxxxxxx |
Run aws-cli command to create a job definition:
aws batch register-job-definition --cli-input-json file://resnet50_jd.json
# Expected Results
{
"jobDefinitionName": "resnet50_jd",
"jobDefinitionArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-definition/resnet50_jd:1",
"revision": 1
}
Submit AWS Batch Job¶
Run aws-cli command to submit a job:
aws batch submit-job --job-name resnet50_batch --job-definition resnet50_jd --job-queue dl1_mnp_jq --node-overrides numNodes=2
# Expected Results
{
"jobArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job/a434b6e9-5fda-415d-befb-079b04c95a97",
"jobName": "resnet50_batch",
"jobId": "a434b6e9-5fda-415d-befb-079b04c95a97"
}
Note
Jobs status can also be submitted/viewed through the AWS Batch Console
Observe Submitted AWS Batch Job Logs¶
AWS Batch creates a Log that is hosted in CloudWatch. Follow View Log Data sent to CloudWatch Logs for specific instructions.