Advanced Model Training Example: Run ResNet Keras Multi-node Cluster

ResNet Keras model from Model-References requires dependencies to be installed. The below are instructions for setting up a ResNet Keras dataset and performing distributed training.

Build and Store Custom Docker Image

  1. To train through EFA, install hccl_ofi_wrapper. This package interacts with libFabric and utilizes the underlying hardware and networking mode. For further information, refer to Scale out Host NIC OFI

  2. Create a Dockerfile with the below content:

FROM vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.10.1:latest

# Clones Model-References and installs Resnet Keras dependencies
RUN git clone -b 1.7.1 https://github.com/HabanaAI/Model-References.git /Model-References && \
cd /Model-References/TensorFlow/computer_vision/Resnets/resnet_keras && \
python3 -m pip install -r requirements.txt

# Installs hccl_ofi_wrapper to interact with libfabric to utilize HW and networking mode (EFA)
ARG OFI_WRAPPER_WS="/root/hccl_ofi_wrapper"
RUN git clone "https://github.com/HabanaAI/hccl_ofi_wrapper.git" "${OFI_WRAPPER_WS}" && \
  cd "${OFI_WRAPPER_WS}" && \
  ln -s /opt/amazon/efa/lib64 /opt/amazon/efa/lib && \
  LIBFABRIC_ROOT=/opt/amazon/efa make

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"${OFI_WRAPPER_WS}"
  1. Build and push image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Amazon ECR Getting Started Guide.

Upload Data to Elastic File System (EFS)

Follow instructions for Resnet Keras Data Generation and upload that to Amazon EFS. For more details on how to use EFS, refer to Amazon EFS Getting Started Guide. This will allow the cluster ease of access to the training dataset.

Enable EFS CSI Driver on EKS

To set up EFS on EKS, refer to EKS EFS CSI Driver Installation Guide.

Launch EFS on EKS

  1. Create storage.yaml file. This file creates a file system that can be accessed from multiple pods at once. For more information on how Persistent Volume functions, refer to Kubernetes Persistent Volume Guide.

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 150Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-05af1ea276164472d
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 150Gi
---
  1. Update the parameters listed below to run the desired configuration:

Parameter

Description

volumeHandle

Elastic File System ID

  1. Create the Persistent Volume by running the following command:

kubectl apply -f storage.yaml

Launch ResNet Keras Training

  1. Create resnet-keras.yaml file with the content below. Check model code README for details on how to run ResNet Multi-Server Training.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: resnet50-keras-5e-2wkr
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          imagePullSecrets:
            - name: private-registry
          terminationGracePeriodSeconds: 0
          containers:
            - image: <Custom Docker Image>
              imagePullPolicy: Always
              name: tensorflow-launcher
              command:
                - bash
                - -c
                - mpirun --allow-run-as-root --bind-to core -np 16 --map-by ppr:4:socket:PE=6 --merge-stderr-to-stdout
                  --prefix /opt/amazon/openmpi
                  -x PYTHONPATH=/Model-References:/usr/lib/habanalabs
                  -x HCL_CONFIG_PATH=/etc/hcl/worker_config.json
                  -x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:/opt/amazon/efa/lib/:/root/hccl_ofi_wrapper:${LD_LIBRARY_PATH}
                  -x HABANA_USE_PREALLOC_BUFFER_FOR_ALLREDUCE=false
                  -x TF_ENABLE_BF16_CONVERSION=1
                  -x TF_ALLOW_CONTROL_EDGES_IN_HABANA_OPS=1
                  -x HABANA_USE_STREAMS_FOR_HCL=true
                  -x TF_PRELIMINARY_CLUSTER_SIZE=200
                  -x RESNET_SIZE=50
                  -x USE_LARS_OPTIMIZER=1
                  -x DISPLAY_STEPS=100
                  -x HOROVOD_HIERARCHICAL_ALLREDUCE=0
                  -x HCCL_OVER_TCP=0
                  -x HCCL_OVER_OFI=1
                  -x FI_PROVIDER=efa
                  python3 /Model-References/TensorFlow/computer_vision/Resnets/resnet_keras/resnet_ctl_imagenet_main.py
                  --optimizer LARS
                  --dtype bf16
                  --data_dir "/data/tensorflow/imagenet2012/tf_records"
                  --steps_per_loop 1000
                  --train_steps 16000
                  --log_steps 200
                  --model_dir "/tmp/resnet/"
                  --data_loader_image_type bf16
                  --base_learning_rate 2.5
                  --warmup_epochs 3
                  --lr_schedule "polynomial"
                  --label_smoothing 0.1
                  --weight_decay 0.0001
                  --enable_tensorboard
                  --experimental_preloading
                  --single_l2_loss_op
                  --use_horovod
                  --batch_size 256
                  --train_epochs 40
                  --epochs_between_evals 40
              resources:
                requests:
                  cpu: "100m"
    Worker:
      replicas: 2
      template:
        spec:
          imagePullSecrets:
            - name: private-registry
          terminationGracePeriodSeconds: 0
          containers:
            - image: <Custom Docker Image>
              name: tensorflow-worker
              securityContext:
                capabilities:
                  add:
                    - SYS_RAWIO
                    - SYS_PTRACE
              resources:
                requests:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: "21000Mi"
                  cpu: "90"
                  vpc.amazonaws.com/efa: 4
                limits:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: "21000Mi"
                  cpu: "90"
                  vpc.amazonaws.com/efa: 4
              volumeMounts:
                - mountPath: /data
                  name: persistent-storage
          volumes:
          - name: persistent-storage
            persistentVolumeClaim:
              claimName: efs-claim
  1. Update the parameters listed below to run the desired configuration:

Parameter

Description

<Custom Docker Image>

Image with ResNet and hccl_ofi_wrapper installed

-np 16

Number of HPU Cards for training. Should update accordingly to match replicas: 2, the number of DL1’s for training

replicas: 2

Number of DL1s for training. Should update accordingly to match -np 16, the number of HPUS for training

  1. Run the job by running the following command:

kubectl apply -f resnet-keras.yaml
  1. Check the job status by running the following command:

kubectl get pods -A
  1. Retrieve the name of the created pod and run the following command to see the results:

kubectl logs <pod-name>