Advanced Model Training Example: Run ResNet Multi-node Cluster¶

The below are instructions for setting up a ResNet dataset and performing distributed training. ResNet Model, from Model References, requires dependencies to be installed.

Build and Store Custom Docker Image¶

Create a Dockerfile with the below content:

FROM vault.habana.ai/gaudi-docker/1.21.2/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

# Clones Model References and installs ResNet dependencies
RUN git clone -b 1.21.0 https://github.com/HabanaAI/Model-References.git /Model-References && \
python3 -m pip install -r /Model-References/PyTorch/computer_vision/classification/torchvision/requirements.txt

Build and push the image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Amazon ECR Getting Started Guide.

Upload Data to Elastic File System (EFS)¶

Follow the instructions for ResNet Data Generation and upload that to Amazon EFS. For more details on how to use EFS, refer to Amazon EFS Getting Started Guide. This will allow the cluster ease of access to the training dataset.

Enable EFS CSI Driver on EKS¶

To set up EFS on EKS, refer to EKS EFS CSI Driver Installation Guide.

Launch EFS on EKS¶

Create storage.yaml file. This file creates a file system that can be accessed from multiple pods at once. For more information on how Persistent Volume functions, refer to Kubernetes Persistent Volume Guide.

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 150Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-05af1ea276164472d
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 150Gi
---

Update the Elastic File System ID using the volumeHandle parameter to run the desired configuration.
Create the Persistent Volume:
```
kubectl apply -f storage.yaml
```

Launch ResNet Training¶

Create resnet.yaml file with the content below. Check the model code README for details on how to run ResNet Multi-Server Training:

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: resnet50-2wkr
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          imagePullSecrets:
            - name: private-registry
          terminationGracePeriodSeconds: 0
          containers:
            - image: <Custom Docker Image>
              imagePullPolicy: Always
              name: resnet-launcher
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;

                  HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
                  MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";

                  NUM_NODES=$(wc -l < $HOSTSFILE);
                  CARDS_PER_NODE=8;
                  N_CARDS=$((NUM_NODES*CARDS_PER_NODE));

                  MODEL_PATH=/Model-References/PyTorch/computer_vision/classification/torchvision/;
                  DATA_DIR=/data/pytorch/imagenet/ILSVRC2012;

                  CMD="python $MODEL_PATH/train.py \
                    --data-path=${DATA_DIR} \
                    --model=resnet50 \
                    --device=hpu \
                    --batch-size=256 \
                    --epochs=90 \
                    --print-freq=1 \
                    --output-dir=. \
                    --seed=123 \
                    --autocast \
                    --custom-lr-values 0.275 0.45 0.625 0.8 0.08 0.008 0.0008 \
                    --custom-lr-milestones 1 2 3 4 30 60 80 \
                    --deterministic \
                    --dl-time-exclude=False";

                    mpirun -np ${N_CARDS} \
                        --allow-run-as-root \
                        --prefix $MPI_ROOT \
                        --map-by ppr:4:socket:PE=6 \
                        --bind-to core \
                        -x PYTHONPATH="/usr/lib/habanalabs:/Model-References" \
                        -x RDMAV_FORK_SAFE=1 \
                        -x FI_EFA_USE_DEVICE_RDMA=1 \
                        -x MASTER_ADDR=$MASTER_ADDR \
                        $CMD;
              resources:
                requests:
                  cpu: "100m"
              volumeMounts:
                - mountPath: /data
                  name: persistent-storage
          volumes:
          - name: persistent-storage
            persistentVolumeClaim:
              claimName: efs-claim
    Worker:
      replicas: 2
      template:
        spec:
          imagePullSecrets:
            - name: private-registry
          hostIPC: true
          containers:
            - image: <Custom Docker Image>
              name: resnet-worker
              resources:
                requests:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: 42000Mi
                  cpu: 90
                  vpc.amazonaws.com/efa: 4
                limits:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: 42000Mi
                  cpu: 90
                  vpc.amazonaws.com/efa: 4
              volumeMounts:
                - mountPath: /data
                  name: persistent-storage
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  /usr/bin/ssh-keygen -A;
                  /usr/sbin/sshd;
                  sleep 365d;
          volumes:
          - name: persistent-storage
            persistentVolumeClaim:
              claimName: efs-claim

Update the parameters listed below to run the desired configuration:

Parameter

Description

<Custom Docker Image>

Image with ResNet dependencies installed

replicas: 2

Number of DL1s for training.
Run the job:
```
kubectl apply -f resnet.yaml
```
Check the job status:
```
kubectl get pods -A
```
Retrieve the name of the created pod and run the following command to see the results:
```
kubectl logs <pod-name>
```

Gaudi Documentation 1.21.1 documentation

Advanced Model Training Example: Run ResNet Multi-node Cluster

On this Page