MPI Operator for Kubernetes
On this Page
MPI Operator for Kubernetes¶
Intel® Gaudi® uses the standard MPI Operator from Kubeflow that enables the running of MPI all reduce style workloads in Kubernetes, leveraging Gaudi accelerators. In combination with Intel Gaudi AI hardware and software, it enables large scale distributed training with simple Kubernetes job distribution model.
Prerequisites¶
The below lists the prerequisites needed for running the Intel Gaudi MPI Operator on Gaudi cards.
Kubernetes version listed in the Support Matrix
Intel Gaudi software drivers loaded on the system.
Make sure the
habana-container-runtime
package is installed.Set up the
habana-container-runtime
.
The installation instructions for the above components are detailed in Installation Guide and On-Premise System Update.
Installation¶
Follow MPI Operator documentation for instructions on setting up mpi-operator on your Kubernetes cluster.
Running Multi-Gaudi Workloads¶
Below is an example of a MPIJob on a MNIST model on 16 Gaudi devices.
Create
mpijob-mnist.yaml
file. Make sure to set the number of Gaudi nodes inWorker -> replicas
:
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: mnist-run
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
name: mnist-launcher
command: ["/bin/bash", "-c"]
args:
- >-
HOSTSFILE=$OMPI_MCA_orte_default_hostfile;
MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)";
NUM_NODES=$(wc -l < $HOSTSFILE);
CARDS_PER_NODE=8;
N_CARDS=$((NUM_NODES*CARDS_PER_NODE));
SETUP_CMD="git clone --branch 1.15.1 https://github.com/HabanaAI/Model-References.git /Model-References";
$SETUP_CMD;
mpirun --npernode 1 \
--tag-output \
--allow-run-as-root \
--prefix $MPI_ROOT \
$SETUP_CMD;
MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world;
MNIST_CMD="python $MODEL_PATH/mnist.py \
--batch-size=64 \
--epochs=1 \
--lr=1.0 \
--gamma=0.7 \
--hpu";
cd $MODEL_PATH;
mpirun -np ${N_CARDS} \
--allow-run-as-root \
--bind-to core \
--map-by ppr:4:socket:PE=6 \
-rank-by core --report-bindings \
--tag-output \
--merge-stderr-to-stdout --prefix $MPI_ROOT \
-x MASTER_ADDR=$MASTER_ADDR \
$MNIST_CMD;
Worker:
replicas: 2
template:
spec:
hostIPC: true
containers:
- image: vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
name: mnist-worker
resources:
limits:
habana.ai/gaudi: 8
memory: 409Gi
hugepages-2Mi: 95000Mi
requests:
habana.ai/gaudi: 8
memory: 409Gi
hugepages-2Mi: 95000Mi
Note
PyTorch uses shared memory buffers to communicate between processes. By default, Docker containers are allocated 64MB of shared memory. When using more than one HPU, this allocation can be insufficient. Setting
hostIPC: true
allows re-using the host’s shared memory space inside the container.According to Kubernetes’ backoff policy, if a failure occurs, such as the worker pods are not running, the job is automatically restarted. This is useful for resuming long-running training from a checkpoint if an error causes the job to crash. For more information, refer to Kubernetes backoff failure policy.
Run the job:
kubectl apply -f mpijob-mnist.yaml
Check the job status:
kubectl get pods -A
Retrieve the name of the created launcher pod and run the following command to see the results:
kubectl logs <pod-name>