MNIST Model Training Example: Run MPIJob on Multi-node Cluster
On this Page
MNIST Model Training Example: Run MPIJob on Multi-node Cluster¶
The below is an MNIST example for model training on Amazon EKS.
Running Multi-Gaudi Workloads¶
According to Kubernetes’ backoff policy, if a failure occurs, such as the worker pods are not running, the job is automatically restarted. This is useful for resuming long-running training from a checkpoint if an error causes the job to crash. For more information, refer to Kubernetes backoff failure policy.
Below is an example of a MPIJob on a MNIST model on 16 Gaudi devices.
file. Make sure to set the number of Gaudi nodes inWorker -> replicas
:apiVersion: kind: MPIJob metadata: name: mnist-run spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: name: mnist-launcher command: ["/bin/bash", "-c"] args: - >- /usr/bin/ssh-keygen -A; /usr/sbin/sshd; HOSTSFILE=$OMPI_MCA_orte_default_hostfile; MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)"; NUM_NODES=$(wc -l < $HOSTSFILE); CARDS_PER_NODE=8; N_CARDS=$((NUM_NODES*CARDS_PER_NODE)); SETUP_CMD="git clone --branch 1.20.0 /Model-References"; $SETUP_CMD; mpirun --npernode 1 \ --tag-output \ --allow-run-as-root \ --prefix $MPI_ROOT \ $SETUP_CMD; MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world; MNIST_CMD="python $MODEL_PATH/ \ --batch-size=64 \ --epochs=1 \ --lr=1.0 \ --gamma=0.7 \ --hpu"; cd $MODEL_PATH; mpirun -np ${N_CARDS} \ --allow-run-as-root \ --bind-to core \ --map-by ppr:4:socket:PE=6 \ -rank-by core --report-bindings \ --tag-output \ --merge-stderr-to-stdout --prefix $MPI_ROOT \ -x MASTER_ADDR=$MASTER_ADDR \ $MNIST_CMD; Worker: replicas: 2 template: spec: hostIPC: true containers: - image: name: mnist-worker resources: limits: 8 memory: 409Gi hugepages-2Mi: 95000Mi 4 requests: 8 memory: 409Gi hugepages-2Mi: 95000Mi 4 command: ["/bin/bash", "-c"] args: - >- /usr/bin/ssh-keygen -A; /usr/sbin/sshd; sleep 365d;
PyTorch uses shared memory buffers to communicate between processes. By default, Docker containers are allocated 64MB of shared memory. When using more than one HPU, this allocation can be insufficient. Setting
hostIPC: true
allows re-using the host’s shared memory space inside the container.Run the job:
kubectl apply -f mpijob-mnist.yaml
Check the job status:
kubectl get pods -A
Retrieve the name of the created launcher pod and run the following command to see the results:
kubectl logs <pod-name>