mnist Model Training Example: Run MPIJob on Multi-node Cluster

The below is an mnist example for model training on Amazon EKS.

Build and Store Custom Docker Image

  1. To train through EFA, install hccl_ofi_wrapper. This package interacts with libFabric and utilizes the underlying hardware and networking mode. For further information, refer to Scale out Host NIC OFI.

  2. Create a Dockerfile with the below content:

FROM vault.habana.ai/gaudi-docker/1.7.1/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.10.1

RUN git clone -b 1.7.1 https://github.com/HabanaAI/Model-References.git

# Installs hccl_ofi_wrapper to interact with libfabric to utilize HW and networking mode (EFA)
ARG OFI_WRAPPER_WS="/root/hccl_ofi_wrapper"
RUN git clone "https://github.com/HabanaAI/hccl_ofi_wrapper.git" "${OFI_WRAPPER_WS}" && \
  cd "${OFI_WRAPPER_WS}" && \
  ln -s /opt/amazon/efa/lib64 /opt/amazon/efa/lib && \
  LIBFABRIC_ROOT=/opt/amazon/efa make

ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"${OFI_WRAPPER_WS}"

3. Build and push image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or to the Amazon ECR Getting Started Guide.

Run MPIJob on Multi-node Cluster

  1. Create mpijob-mnist.yaml file. The config file can pull a docker image and set up a container according to habana.ai/gaudi, hugepages-2Mi, memory, etc. These three parameters could be adapted by your task and model.

The following is an example of mpijob-mnist.yaml. Check model code README for details on how to run Multi-Node training:

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: mnist-distributed
spec:
  slotsPerWorker: 8
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          imagePullSecrets:
            - name: private-registry
          terminationGracePeriodSeconds: 0
          containers:
            - image: <Custom Docker Image>
              imagePullPolicy: Always
              name: mnist-launcher
              command:
                - bash
                - -c
                - mpirun --allow-run-as-root --bind-to core -np 16 --map-by ppr:4:socket:PE=6 --merge-stderr-to-stdout
                  --prefix /opt/amazon/openmpi
                  -x PYTHONPATH=/Model-References:/usr/lib/habanalabs
                  -x HCL_CONFIG_PATH=/etc/hcl/worker_config.json
                  -x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:/opt/amazon/efa/lib/:/root/hccl_ofi_wrapper:${LD_LIBRARY_PATH}
                  -x HCCL_OVER_TCP=0
                  -x HCCL_OVER_OFI=1
                  -x FI_PROVIDER=efa
                  python3 /Model-References/TensorFlow/examples/hello_world/example_hvd.py
              resources:
                requests:
                  cpu: "100m"
    Worker:
      replicas: 2
      template:
        spec:
          imagePullSecrets:
            - name: private-registry
          terminationGracePeriodSeconds: 0
          containers:
            - image: <Custom Docker Image>
              name: mnist-worker
              securityContext:
                capabilities:
                  add:
                    - SYS_RAWIO
                    - SYS_PTRACE
              resources:
                requests:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: "21000Mi"
                  vpc.amazonaws.com/efa: 4
                  cpu: "90"
                limits:
                  habana.ai/gaudi: 8
                  hugepages-2Mi: "21000Mi"
                  vpc.amazonaws.com/efa: 4
                  cpu: "90"
  1. Update the parameters listed below to run the desired configuration:

Parameter

Description

<Custom Docker Image>

Image with Resnet and hccl_ofi_wrapper installed

-np 16

Number of HPU Cards for training. Should update accordingly to match replicas: 2, the number of DL1’s for training

replicas: 2

Number of DL1s for training. Should update accordingly to match -np 16, the number of HPUS for training

  1. Run the job by running the following command:

kubectl apply -f mpijob-mnist.yaml
  1. Check the job status by running the following command:

kubectl get pods -A
  1. Retrieve the name of the created pod and run the following command to see the results:

kubectl logs <pod-name>