Running Kubernetes Workloads with Gaudi
On this Page
Running Kubernetes Workloads with Gaudi¶
Kubernetes provides an efficient and manageable way to orchestrate deep learning workloads at scale.
Prerequisites¶
Kubernetes version listed in the Support Matrix.
Make sure to install the Intel Gaudi Base Operator for Kubernetes or the Intel Gaudi Device Plugin for Kubernetes. For more details, refer to Kubernetes Installation.
Running Gaudi Jobs Example¶
You can create a Kubernetes job that acquires a Gaudi device by using
the resource.limits
field. Below is an example using Intel Gaudi’s PyTorch
container image.
Run the job:
cat <<EOF | kubectl apply -f - apiVersion: batch/v1 kind: Job metadata: name: habanalabs-gaudi-demo spec: template: spec: hostIPC: true restartPolicy: OnFailure containers: - name: habana-ai-base-container image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest workingDir: /root command: ["hl-smi"] securityContext: capabilities: add: ["SYS_NICE"] resources: limits: habana.ai/gaudi: 1 memory: 409Gi hugepages-2Mi: 95000Mi EOF
Check the pod status:
kubectl get pods
Retrieve the name of the pod and see the results:
kubectl logs <pod-name>
Note
After setting up your Kubernetes cluster, use Prometheus Metric Exporter to collect the Gaudi device metrics.
Running Gaudi MNIST Training Job Example¶
Below is an example of training a MNIST PyTorch model using Intel Gaudi’s PyTorch container image.
Create a
mnist.yaml
file:apiVersion: batch/v1 kind: Job metadata: name: mnist-demo spec: template: spec: hostIPC: true restartPolicy: OnFailure containers: - name: mnist image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest command: ["/bin/bash", "-c"] args: - >- git clone --branch 1.18.0 https://github.com/HabanaAI/Model-References.git /Model-References; MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world; cd $MODEL_PATH; MNIST_CMD="python mnist.py \ --batch-size=64 \ --epochs=1 \ --lr=1.0 \ --gamma=0.7 \ --hpu"; mpirun -np 8 \ --allow-run-as-root \ --bind-to core \ --map-by ppr:4:socket:PE=6 \ -rank-by core --report-bindings \ --tag-output \ --merge-stderr-to-stdout --prefix $MPI_ROOT \ $MNIST_CMD; securityContext: capabilities: add: ["SYS_NICE"] resources: limits: habana.ai/gaudi: 8 memory: 409Gi hugepages-2Mi: 95000Mi requests: habana.ai/gaudi: 8 memory: 409Gi hugepages-2Mi: 95000Mi
Run the job:
kubectl apply -f mnist.yaml
Check the pod status:
kubectl get pods
Retrieve the name of the pod and see the results:
kubectl logs <pod-name>
MPI Operator for Multi-Gaudi Nodes¶
Intel® Gaudi® uses the standard MPI Operator from Kubeflow that allows
running MPI allreduce
style workloads in Kubernetes and leveraging Gaudi
accelerators. In combination with Intel Gaudi hardware and software, it
enables large scale distributed training with simple Kubernetes job
distribution model.
Installing MPI Operator¶
Follow MPI Operator documentation for instructions on setting up MPI Operator on your Kubernetes cluster.
Running Multi-Gaudi Workloads Example¶
Below is an example of a MPIJob on a MNIST model on 16 Gaudi devices.
Create
mpijob-mnist.yaml
file. Make sure to set the number of Gaudi nodes inWorker -> replicas
:apiVersion: kubeflow.org/v2beta1 kind: MPIJob metadata: name: mnist-run spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest name: mnist-launcher command: ["/bin/bash", "-c"] args: - >- /usr/bin/ssh-keygen -A; /usr/sbin/sshd; HOSTSFILE=$OMPI_MCA_orte_default_hostfile; MASTER_ADDR="$(head -n 1 $HOSTSFILE | sed -n s/[[:space:]]slots.*//p)"; NUM_NODES=$(wc -l < $HOSTSFILE); CARDS_PER_NODE=8; N_CARDS=$((NUM_NODES*CARDS_PER_NODE)); SETUP_CMD="git clone --branch 1.18.0 https://github.com/HabanaAI/Model-References.git /Model-References"; $SETUP_CMD; mpirun --npernode 1 \ --tag-output \ --allow-run-as-root \ --prefix $MPI_ROOT \ $SETUP_CMD; MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world; MNIST_CMD="python $MODEL_PATH/mnist.py \ --batch-size=64 \ --epochs=1 \ --lr=1.0 \ --gamma=0.7 \ --hpu"; cd $MODEL_PATH; mpirun -np ${N_CARDS} \ --allow-run-as-root \ --bind-to core \ --map-by ppr:4:socket:PE=6 \ -rank-by core --report-bindings \ --tag-output \ --merge-stderr-to-stdout --prefix $MPI_ROOT \ -x MASTER_ADDR=$MASTER_ADDR \ $MNIST_CMD; Worker: replicas: 2 template: spec: hostIPC: true containers: - image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest name: mnist-worker resources: limits: habana.ai/gaudi: 8 memory: 409Gi hugepages-2Mi: 95000Mi requests: habana.ai/gaudi: 8 memory: 409Gi hugepages-2Mi: 95000Mi command: ["/bin/bash", "-c"] args: - >- /usr/bin/ssh-keygen -A; /usr/sbin/sshd; sleep 365d;
Note
PyTorch uses shared memory buffers to communicate between processes. By default, Docker containers are allocated 64MB of shared memory. When using more than one HPU, this allocation can be insufficient. Setting
hostIPC: true
allows re-using the host’s shared memory space inside the container.According to Kubernetes’ backoff policy, if a failure occurs, such as the worker pods are not running, the job is automatically restarted. This is useful for resuming long-running training from a checkpoint if an error causes the job to crash. For more information, refer to Kubernetes backoff failure policy.
Run the job:
kubectl apply -f mpijob-mnist.yaml
Check the pod status:
kubectl get pods -A
Retrieve the name of the pod and see the results:
kubectl logs <pod-name>
Note
After setting up your Kubernetes cluster, use Prometheus Metric Exporter to collect the Gaudi device metrics.