mnist Model Training Example: Run MPIJob on Multi-node Cluster
On this Page
mnist Model Training Example: Run MPIJob on Multi-node Cluster¶
The below is an mnist example for model training on Amazon EKS.
Build and Store Custom Docker Image¶
Create a Dockerfile with the below content:
FROM vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1
RUN git clone -b 1.11.0 https://github.com/HabanaAI/Model-References.git
FROM vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
RUN git clone -b 1.11.0 https://github.com/HabanaAI/Model-References.git
2. Build and push image to AWS’s Elastic Container Registry (ECR) for ease of access on EC2 instances. For further information on how to build and push an image to ECR, refer to Create Elastic Container Registry (ECR) and Upload Images or to the Amazon ECR Getting Started Guide.
Run MPIJob on Multi-node Cluster¶
Create
mpijob-mnist.yaml
file. The config file can pull a docker image and set up a container according tohabana.ai/gaudi
,hugepages-2Mi
,memory
, etc. These three parameters could be adapted by your task and model.
The following is an example of mpijob-mnist.yaml
. Check model code README for details on how to run Multi-Node training:
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: mnist-distributed
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
spec:
imagePullSecrets:
- name: private-registry
terminationGracePeriodSeconds: 0
containers:
- image: <Custom Docker Image>
imagePullPolicy: Always
name: mnist-launcher
command:
- bash
- -c
- mpirun --allow-run-as-root --bind-to core -np 16 --map-by ppr:4:socket:PE=6 --merge-stderr-to-stdout
--prefix /opt/amazon/openmpi
-x PYTHONPATH=/Model-References:/usr/lib/habanalabs
-x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:/opt/amazon/efa/lib/:${LD_LIBRARY_PATH}
-x RDMAV_FORK_SAFE=1
-x FI_EFA_USE_DEVICE_RDMA=1
python3 /Model-References/TensorFlow/examples/hello_world/example_hvd.py
resources:
requests:
cpu: "100m"
Worker:
replicas: 2
template:
spec:
imagePullSecrets:
- name: private-registry
terminationGracePeriodSeconds: 0
containers:
- image: <Custom Docker Image>
name: mnist-worker
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
requests:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
vpc.amazonaws.com/efa: 4
cpu: "90"
limits:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
vpc.amazonaws.com/efa: 4
cpu: "90"
The following is an example of mpijob-mnist.yaml
. Check model code README for details on how to run Multi-Node training:
Note
PyTorch uses shared memory buffers to communicate between processes. By default, Docker containers are allocated 64MB of shared memory. When using > 1 HPU, this can be insufficient. To bypass this issue, specifying hostIPC: true will re-use the host’s shared memory space inside the container.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: mnist-distributed
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
restartPolicy: Never
template:
spec:
imagePullSecrets:
- name: private-registry
terminationGracePeriodSeconds: 0
containers:
- image: <Custom Docker Image>
imagePullPolicy: Always
name: mnist-launcher
command:
- bash
- -c
- mpirun --allow-run-as-root --bind-to core -np 16 --map-by ppr:4:socket:PE=6 --merge-stderr-to-stdout
--prefix /opt/amazon/openmpi
-x PYTHONPATH=/Model-References:/usr/lib/habanalabs
-x LD_LIBRARY_PATH=/opt/amazon/openmpi/lib:/opt/amazon/efa/lib/:${LD_LIBRARY_PATH}
-x RDMAV_FORK_SAFE=1
-x FI_EFA_USE_DEVICE_RDMA=1
python3 /Model-References/PyTorch/examples/computer_vision/hello_world/mnist.py
--batch-size=64 --epochs=1 --lr=1.0 --gamma=0.7 --hpu --use_lazy_mode
resources:
requests:
cpu: "100m"
Worker:
replicas: 2
template:
spec:
imagePullSecrets:
- name: private-registry
terminationGracePeriodSeconds: 0
hostIPC: true
containers:
- image: <Custom Docker Image>
name: mnist-worker
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
requests:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
vpc.amazonaws.com/efa: 4
cpu: "90"
limits:
habana.ai/gaudi: 8
hugepages-2Mi: "42000Mi"
vpc.amazonaws.com/efa: 4
cpu: "90"
Update the parameters listed below to run the desired configuration:
Parameter
Description
<Custom Docker Image>
Image with Resnet dependencies installed
-np 16
Number of HPU Cards for training. Should update accordingly to match
replicas: 2
, the number of DL1’s for trainingreplicas: 2
Number of DL1s for training. Should update accordingly to match
-np 16
, the number of HPUS for trainingRun the job by running the following command:
kubectl apply -f mpijob-mnist.yaml
Check the job status by running the following command:
kubectl get pods -A
Retrieve the name of the created pod and run the following command to see the results:
kubectl logs <pod-name>