Advanced Model Training Batch Example: ResNet50

An advanced use case involves a large dataset. AWS Elastic File Service (EFS) provides a simple, serverless, scalable storage system perfectly suited for dataset storage.

This sections focuses on:

  • Creating a Dockerfile with ResNet50 dependencies

  • Updating AWS Batch shell scripts to run ResNet50 training scripts from Model-References

  • Building and Pushing Batch ResNet50 Image to ECR

  • Creating and submitting a Job Definition

ResNet50 model requires dependencies to be installed. The below are instructions for setting up a ResNet dataset, creating, and submitting a ResNet50 Job on AWS Batch only.

ResNet50 Docker Image

The standard Dockerfile example needs modifications to properly run ResNet50. The changes are highlighted below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
  FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest

  ENV HOME /root
  RUN echo $HOME

  #################################
  # Clone Model References
  #################################
  RUN git clone -b 1.15.1 https://github.com/HabanaAI/Model-References.git /Model-References && \
  python3 -m pip install -r /Model-References/PyTorch/computer_vision/classification/torchvision/requirements.txt

  ENV PYTHONPATH=$PYTHONPATH:/Model-References


  #################################
  # Supervisord Install
  #################################
  RUN apt-get update
  RUN pip install supervisor

  #################################
  # SSH Setup
  #################################
  ENV SSHDIR $HOME/.ssh
  RUN mkdir -p ${SSHDIR} \
  && touch ${SSHDIR}/sshd_config \
  && ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
  && cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
  && cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
  && echo "       IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
  && echo "       StrictHostKeyChecking no" >> ${SSHDIR}/config \
  && echo "       Port 3022" >> ${SSHDIR}/config \
  && echo 'Port 3022' >> ${SSHDIR}/sshd_config \
  && echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
  && echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
  && chmod -R 600 ${SSHDIR}/*
  RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

  #################################
  # Copy Necessary Scripts
  #################################
  COPY entry-point.sh /entry-point.sh
  COPY run_batch.sh /run_batch.sh
  COPY run_resnet50.sh /run_resnet50.sh
  COPY supervisord.conf /conf/supervisord.conf

  CMD /entry-point.sh

ResNet50 Training Script

Save the following code as run_resnet50.sh. This will launch a distributed ResNet50 training example through mpirun.

#!/bin/bash

###############################################################################################
# Example: Training RESNET50 on multinode with 8 card each
###############################################################################################

HOSTSFILE=$HOSTSFILE;
MASTER_ADDR=$MASTER_IP;

NUM_NODES=$NUM_NODES;
CARDS_PER_NODE=8;
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));

MODEL_PATH=/Model-References/PyTorch/computer_vision/classification/torchvision/;
DATA_DIR=/data/pytorch/imagenet/ILSVRC2012;

CMD="python $MODEL_PATH/train.py \
        --data-path=${DATA_DIR} \
        --model=resnet50 \
        --device=hpu \
        --batch-size=256 \
        --epochs=90 \
        --print-freq=1 \
        --output-dir=. \
        --seed=123 \
        --autocast \
        --custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 \
        --custom-lr-milestones 1 2 3 4 30 60 80 \
        --deterministic \
        --dl-time-exclude=False";

# Configure multinode
if [ "$NUM_NODES" -ne "1" -a -f "$HOSTSFILE" ]
then
    MULTINODE_CMD="--hostfile $HOSTSFILE";
fi

mpirun -np ${N_CARDS} \
    --allow-run-as-root \
    $MULTINODE_CMD \
    --prefix $MPI_ROOT \
    --mca plm_rsh_args "-p 3022" \
    --map-by ppr:4:socket:PE=6 \
    --bind-to core \
    -x PYTHONPATH="/usr/lib/habanalabs:/Model-References" \
    -x RDMAV_FORK_SAFE=1 \
    -x FI_EFA_USE_DEVICE_RDMA=1 \
    -x MASTER_ADDR=$MASTER_ADDR \
    $CMD;

Update run_batch.sh script to run ResNet50 Training

The wait_for_nodes() in run_batch.sh function has an area that can be treated as a template for running other distributed training models. Update this area to run the run_resnet50.sh:

MASTER_IP=$ip HOSTSFILE=$HOST_FILE_PATH-deduped NUM_NODES=$lines /run_resnet50.sh

Build and Push Image to ECR

Build and push a resnet50_batch_training image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or to the Amazon ECR Getting Started Guide.

Upload Data to Elastic File System (EFS)

Follow instructions for Resnet Data Generation and upload that to Amazon EFS. For more details on how to use EFS, refer to Amazon EFS Getting Started Guide. This will allow the cluster ease of access to the training dataset.

Create AWS Batch Job Definition

To use Amazon EFS on AWS Batch, it requires specifying the volume and mount point configurations in the job definition.

  1. Create resnet50_jd.json with the following configuration and update the placeholders:

{
    "jobDefinitionName": "resnet50_jd",
    "type": "multinode",
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:",
                "container": {
                    "image": "IMAGE_NAME",
                    "command": [],
                    "jobRoleArn": "TASK_EXEC_ROLE",
                    "resourceRequirements": [
                        {
                            "type": "MEMORY",
                            "value": "760000"
                        },
                        {
                            "type": "VCPU",
                            "value": "96"
                        }
                    ],
                    "mountPoints": [
                        {
                            "sourceVolume": "myEfsVolume",
                            "containerPath": "/data",
                            "readOnly": true
                        }
                    ],
                    "volumes": [
                        {
                            "name": "myEfsVolume",
                            "efsVolumeConfiguration": {
                                "fileSystemId": "fs-xxxxxxxx",
                                "rootDirectory": "/path/to/my/data",
                            }
                        }
                    ],
                    "environment": [],
                    "ulimits": [],
                    "instanceType": "dl1.24xlarge",
                    "linuxParameters": {
                        "devices": [
                            {
                                "hostPath": "/dev/infiniband/uverbs0",
                                "containerPath": "/dev/infiniband/uverbs0",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs1",
                                "containerPath": "/dev/infiniband/uverbs1",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs2",
                                "containerPath": "/dev/infiniband/uverbs2",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs3",
                                "containerPath": "/dev/infiniband/uverbs3",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                        ]
                    },
                    "privileged": true
                }
            }
        ]
    }
}

PlaceHolder

Replace

IMAGE_NAME

xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/resnet50_batch_training:v1

TASK_EXEC_ROLE

arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole

EFS_ID

fs-xxxxxxxx

  1. Run aws-cli command to create a job definition:

aws batch register-job-definition --cli-input-json file://resnet50_jd.json

# Expected Results
{
    "jobDefinitionName": "resnet50_jd",
    "jobDefinitionArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-definition/resnet50_jd:1",
    "revision": 1
}

Submit AWS Batch Job

Run aws-cli command to submit a job:

aws batch submit-job --job-name resnet50_batch --job-definition resnet50_jd --job-queue dl1_mnp_jq --node-overrides numNodes=2

# Expected Results
{
    "jobArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job/a434b6e9-5fda-415d-befb-079b04c95a97",
    "jobName": "resnet50_batch",
    "jobId": "a434b6e9-5fda-415d-befb-079b04c95a97"
}

Note

Jobs status can also be submitted/viewed through the AWS Batch Console

Observe Submitted AWS Batch Job Logs

AWS Batch creates a Log that is hosted in CloudWatch. Follow View Log Data sent to CloudWatch Logs for specific instructions.