Advanced Model Training Batch Example: ResNet50 Keras

An advanced use case involves a large dataset. AWS Elastic File Service (EFS) provides a simple, serverless, scalable storage system perfectly suited for dataset storage.

This sections focuses on:

  • Creating a Dockerfile with ResNet50 Keras dependencies

  • Updating AWS Batch shell scripts to run ResNet50 Keras training scripts from Model-References

  • Building and Pushing Batch ResNet50 Keras Image to ECR

  • Creating and submitting a Job Definition

ResNet Keras model requires dependencies to be installed. The below are instructions for setting up a ResNet Keras dataset, creating, and submitting a ResNet50 Keras Job on AWS Batch only.

ResNet50 Keras Docker Image

The standard Dockerfile example needs modifications to properly run ResNet50 Keras. The changes are highlighted below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
  FROM vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1:latest

  ENV HOME /root
  RUN echo $HOME

  #################################
  # Clone Model References
  #################################
  RUN git clone -b 1.11.0 https://github.com/HabanaAI/Model-References.git /Model-References && \
  python3 -m pip install -r /Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/requirements.txt

  ENV PYTHONPATH=$PYTHONPATH:/Model-References


  #################################
  # Supervisord Install
  #################################
  RUN apt-get update
  RUN pip install supervisor

  #################################
  # SSH Setup
  #################################
  ENV SSHDIR $HOME/.ssh
  RUN mkdir -p ${SSHDIR} \
  && touch ${SSHDIR}/sshd_config \
  && ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
  && cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
  && cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
  && echo "       IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
  && echo "       StrictHostKeyChecking no" >> ${SSHDIR}/config \
  && echo "       Port 3022" >> ${SSHDIR}/config \
  && echo 'Port 3022' >> ${SSHDIR}/sshd_config \
  && echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
  && echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
  && chmod -R 600 ${SSHDIR}/*
  RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

  #################################
  # Copy Necessary Scripts
  #################################
  COPY entry-point.sh /entry-point.sh
  COPY run_batch.sh /run_batch.sh
  COPY run_resnet50_keras.sh /run_resnet50_keras.sh
  COPY supervisord.conf /conf/supervisord.conf

  CMD /entry-point.sh

ResNet50 Keras Training Script

Save the following code as run_resnet50_keras.sh. This will launch a distributed ResNet50 Keras training example through mpirun.

#!/bin/bash

###############################################################################################
# Example: Training RESNET50 on multinode with 8 card each
###############################################################################################

model_path=/Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/

NUM_NODES=$NUM_NODES
NGPU_PER_NODE=8
MASTER_ADDR=$MASTER_IP

let N_CARDS=$NUM_NODES*$NGPU_PER_NODE

DATA_DIR=/data/tensorflow/imagenet2012/tf_records
HOSTSFILE=$HOSTSFILE

CMD="/usr/bin/python3 -u ${model_path}/resnet_ctl_imagenet_main.py \
    -dt bf16 \
    -dlit bf16 \
    -bs 256 \
    -te 40 \
    -ebe 40 \
    --use_horovod \
    --data_dir ${DATA_DIR} \
    --optimizer LARS \
    --base_learning_rate 9.5 \
    --warmup_epochs 3 \
    --lr_schedule polynomial \
    --label_smoothing 0.1 \
    --weight_decay 0.0001 \
    --single_l2_loss_op "

# Configure multinode
if [ "$NUM_NODES" -ne "1" -a -f "$HOSTSFILE" ]
then
    MULTINODE_CMD="--hostfile $HOSTSFILE"
fi

mpirun --allow-run-as-root \
    $MULTINODE_CMD \
    -np ${N_CARDS} \
    -x LD_LIBRARY_PATH="/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:${LD_LIBRARY_PATH}" \
    -x PYTHONPATH="/usr/lib/habanalabs:/Model-References" \
    -x HCCL_COMM_ID="$MASTER_ADDR:9696" \
    -x HCCL_SOCKET_IFNAME="eth0" \
    -x RDMAV_FORK_SAFE=1 \
    -x FI_EFA_USE_DEVICE_RDMA=1 \
    --mca btl_tcp_if_include "eth0" \
    -x MASTER_ADDR=$MASTER_ADDR \
    -x MASTER_PORT=12345 \
    --prefix /opt/amazon/openmpi \
    --mca plm_rsh_args "-p 3022" \
    --map-by ppr:4:socket:PE=6 \
    --bind-to core \
    $CMD

Update run_batch.sh script to run ResNet50 Training

The wait_for_nodes() in run_batch.sh function has an area that can be treated as a template for running other distributed training models. Update this area to run the run_resnet50_keras.sh:

MASTER_IP=$ip HOSTSFILE=$HOST_FILE_PATH-deduped NUM_NODES=$lines /run_resnet50_keras.sh

Build and Push Image to ECR

  1. Update config.properties to reflect new ResNet50 Image modifications:

registry=xxxxxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com
region=us-west-2
image_name=resnet50_keras_batch_training
image_tag=v1

Field

Description

registry

Update xxxx to match your AWS Account Number

  1. Build and push Image:

./build.sh && ./push.sh

Upload Data to Elastic File System (EFS)

Follow instructions for Resnet Keras Data Generation and upload that to Amazon EFS. For more details on how to use EFS, refer to Amazon EFS Getting Started Guide. This will allow the cluster ease of access to the training dataset.

Create AWS Batch Job Definition

To use Amazon EFS on AWS Batch, it requires specifying the volume and mount point configurations in the job definition.

  1. Create resnet50_keras_jd.json with the following configuration and update the placeholders:

{
    "jobDefinitionName": "resnet50_jd",
    "type": "multinode",
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:",
                "container": {
                    "image": "IMAGE_NAME",
                    "command": [],
                    "jobRoleArn": "TASK_EXEC_ROLE",
                    "resourceRequirements": [
                        {
                            "type": "MEMORY",
                            "value": "760000"
                        },
                        {
                            "type": "VCPU",
                            "value": "96"
                        }
                    ],
                    "mountPoints": [
                        {
                            "sourceVolume": "myEfsVolume",
                            "containerPath": "/data",
                            "readOnly": true
                        }
                    ],
                    "volumes": [
                        {
                            "name": "myEfsVolume",
                            "efsVolumeConfiguration": {
                                "fileSystemId": "fs-xxxxxxxx",
                                "rootDirectory": "/path/to/my/data",
                            }
                        }
                    ],
                    "environment": [],
                    "ulimits": [],
                    "instanceType": "dl1.24xlarge",
                    "linuxParameters": {
                        "devices": [
                            {
                                "hostPath": "/dev/infiniband/uverbs0",
                                "containerPath": "/dev/infiniband/uverbs0",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs1",
                                "containerPath": "/dev/infiniband/uverbs1",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs2",
                                "containerPath": "/dev/infiniband/uverbs2",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs3",
                                "containerPath": "/dev/infiniband/uverbs3",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                        ]
                    },
                    "privileged": true
                }
            }
        ]
    }
}

PlaceHolder

Replace

IMAGE_NAME

xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/resnet50_keras_batch_training:v1

TASK_EXEC_ROLE

arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole

EFS_ID

fs-xxxxxxxx

  1. Run aws-cli command to create a job definition:

aws batch register-job-definition --cli-input-json file://resnet50_jd.json

# Expected Results
{
    "jobDefinitionName": "resnet50_jd",
    "jobDefinitionArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-definition/resnet50_jd:1",
    "revision": 1
}

Submit AWS Batch Job

Run aws-cli command to submit a job:

aws batch submit-job --job-name resnet50_batch --job-definition resnet50_jd --job-queue dl1_mnp_jq --node-overrides numNodes=2

# Expected Results
{
    "jobArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job/a434b6e9-5fda-415d-befb-079b04c95a97",
    "jobName": "resnet50_batch",
    "jobId": "a434b6e9-5fda-415d-befb-079b04c95a97"
}

Note

Jobs status can also be submitted/viewed through the AWS Batch Console

Observe Submitted AWS Batch Job Logs

AWS Batch creates a Log that is hosted in CloudWatch. Follow View Log Data sent to CloudWatch Logs for specific instructions.