Advanced Model Training Batch Example: ResNet50¶

An advanced use case involves a large dataset. AWS Elastic File Service (EFS) provides a simple, serverless, scalable storage system perfectly suited for dataset storage.

This sections focuses on:

Creating a Dockerfile with ResNet50 dependencies
Updating AWS Batch shell scripts to run ResNet50 training scripts from Model-References
Building and Pushing Batch ResNet50 Image to ECR
Creating and submitting a Job Definition

ResNet50 model requires dependencies to be installed. The below are instructions for setting up a ResNet dataset, creating, and submitting a ResNet50 Job on AWS Batch only.

ResNet50 Docker Image¶

The standard Dockerfile example needs modifications to properly run ResNet50. The changes are highlighted below.

  FROM vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest

  ENV HOME /root
  RUN echo $HOME

  #################################
  # Clone Model References
  #################################
  RUN git clone -b 1.15.1 https://github.com/HabanaAI/Model-References.git /Model-References && \
  python3 -m pip install -r /Model-References/PyTorch/computer_vision/classification/torchvision/requirements.txt

  ENV PYTHONPATH=$PYTHONPATH:/Model-References


  #################################
  # Supervisord Install
  #################################
  RUN apt-get update
  RUN pip install supervisor

  #################################
  # SSH Setup
  #################################
  ENV SSHDIR $HOME/.ssh
  RUN mkdir -p ${SSHDIR} \
  && touch ${SSHDIR}/sshd_config \
  && ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N '' \
  && cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \
  && cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \
  && echo "       IdentityFile ${SSHDIR}/id_rsa" >> ${SSHDIR}/config \
  && echo "       StrictHostKeyChecking no" >> ${SSHDIR}/config \
  && echo "       Port 3022" >> ${SSHDIR}/config \
  && echo 'Port 3022' >> ${SSHDIR}/sshd_config \
  && echo "HostKey ${SSHDIR}/ssh_host_rsa_key" >> ${SSHDIR}/sshd_config \
  && echo "PidFile ${SSHDIR}/sshd.pid" >> ${SSHDIR}/sshd_config \
  && chmod -R 600 ${SSHDIR}/*
  RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

  #################################
  # Copy Necessary Scripts
  #################################
  COPY entry-point.sh /entry-point.sh
  COPY run_batch.sh /run_batch.sh
  COPY run_resnet50.sh /run_resnet50.sh
  COPY supervisord.conf /conf/supervisord.conf

  CMD /entry-point.sh

ResNet50 Training Script¶

Save the following code as run_resnet50.sh. This will launch a distributed ResNet50 training example through mpirun.

#!/bin/bash

###############################################################################################
# Example: Training RESNET50 on multinode with 8 card each
###############################################################################################

HOSTSFILE=$HOSTSFILE;
MASTER_ADDR=$MASTER_IP;

NUM_NODES=$NUM_NODES;
CARDS_PER_NODE=8;
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));

MODEL_PATH=/Model-References/PyTorch/computer_vision/classification/torchvision/;
DATA_DIR=/data/pytorch/imagenet/ILSVRC2012;

CMD="python $MODEL_PATH/train.py \
        --data-path=${DATA_DIR} \
        --model=resnet50 \
        --device=hpu \
        --batch-size=256 \
        --epochs=90 \
        --print-freq=1 \
        --output-dir=. \
        --seed=123 \
        --autocast \
        --custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 \
        --custom-lr-milestones 1 2 3 4 30 60 80 \
        --deterministic \
        --dl-time-exclude=False";

# Configure multinode
if [ "$NUM_NODES" -ne "1" -a -f "$HOSTSFILE" ]
then
    MULTINODE_CMD="--hostfile $HOSTSFILE";
fi

mpirun -np ${N_CARDS} \
    --allow-run-as-root \
    $MULTINODE_CMD \
    --prefix $MPI_ROOT \
    --mca plm_rsh_args "-p 3022" \
    --map-by ppr:4:socket:PE=6 \
    --bind-to core \
    -x PYTHONPATH="/usr/lib/habanalabs:/Model-References" \
    -x RDMAV_FORK_SAFE=1 \
    -x FI_EFA_USE_DEVICE_RDMA=1 \
    -x MASTER_ADDR=$MASTER_ADDR \
    $CMD;

Update run_batch.sh script to run ResNet50 Training¶

The wait_for_nodes() in run_batch.sh function has an area that can be treated as a template for running other distributed training models. Update this area to run the run_resnet50.sh:

MASTER_IP=$ip HOSTSFILE=$HOST_FILE_PATH-deduped NUM_NODES=$lines /run_resnet50.sh

Build and Push Image to ECR¶

Build and push a resnet50_batch_training image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or to the Amazon ECR Getting Started Guide.

Upload Data to Elastic File System (EFS)¶

Follow instructions for Resnet Data Generation and upload that to Amazon EFS. For more details on how to use EFS, refer to Amazon EFS Getting Started Guide. This will allow the cluster ease of access to the training dataset.

Create AWS Batch Job Definition¶

To use Amazon EFS on AWS Batch, it requires specifying the volume and mount point configurations in the job definition.

Create resnet50_jd.json with the following configuration and update the placeholders:

{
    "jobDefinitionName": "resnet50_jd",
    "type": "multinode",
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:",
                "container": {
                    "image": "IMAGE_NAME",
                    "command": [],
                    "jobRoleArn": "TASK_EXEC_ROLE",
                    "resourceRequirements": [
                        {
                            "type": "MEMORY",
                            "value": "760000"
                        },
                        {
                            "type": "VCPU",
                            "value": "96"
                        }
                    ],
                    "mountPoints": [
                        {
                            "sourceVolume": "myEfsVolume",
                            "containerPath": "/data",
                            "readOnly": true
                        }
                    ],
                    "volumes": [
                        {
                            "name": "myEfsVolume",
                            "efsVolumeConfiguration": {
                                "fileSystemId": "fs-xxxxxxxx",
                                "rootDirectory": "/path/to/my/data",
                            }
                        }
                    ],
                    "environment": [],
                    "ulimits": [],
                    "instanceType": "dl1.24xlarge",
                    "linuxParameters": {
                        "devices": [
                            {
                                "hostPath": "/dev/infiniband/uverbs0",
                                "containerPath": "/dev/infiniband/uverbs0",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs1",
                                "containerPath": "/dev/infiniband/uverbs1",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs2",
                                "containerPath": "/dev/infiniband/uverbs2",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs3",
                                "containerPath": "/dev/infiniband/uverbs3",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                        ]
                    },
                    "privileged": true
                }
            }
        ]
    }
}

PlaceHolder	Replace
IMAGE_NAME	xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/resnet50_batch_training:v1
TASK_EXEC_ROLE	arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole
EFS_ID	fs-xxxxxxxx

Run aws-cli command to create a job definition:

aws batch register-job-definition --cli-input-json file://resnet50_jd.json

# Expected Results
{
    "jobDefinitionName": "resnet50_jd",
    "jobDefinitionArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-definition/resnet50_jd:1",
    "revision": 1
}

Submit AWS Batch Job¶

Run aws-cli command to submit a job:

aws batch submit-job --job-name resnet50_batch --job-definition resnet50_jd --job-queue dl1_mnp_jq --node-overrides numNodes=2

# Expected Results
{
    "jobArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job/a434b6e9-5fda-415d-befb-079b04c95a97",
    "jobName": "resnet50_batch",
    "jobId": "a434b6e9-5fda-415d-befb-079b04c95a97"
}

Note

Jobs status can also be submitted/viewed through the AWS Batch Console

Observe Submitted AWS Batch Job Logs¶

AWS Batch creates a Log that is hosted in CloudWatch. Follow View Log Data sent to CloudWatch Logs for specific instructions.

Gaudi Documentation 1.15.1 documentation

Advanced Model Training Batch Example: ResNet50

On this Page

Advanced Model Training Batch Example: ResNet50¶

ResNet50 Docker Image¶

ResNet50 Training Script¶

Update run_batch.sh script to run ResNet50 Training¶

Build and Push Image to ECR¶

Upload Data to Elastic File System (EFS)¶

Create AWS Batch Job Definition¶

Submit AWS Batch Job¶

Observe Submitted AWS Batch Job Logs¶