Intel Gaudi Device Plugin for Kubernetes
On this Page
Intel Gaudi Device Plugin for Kubernetes¶
This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you can run jobs on the Gaudi device.
The Intel Gaudi device plugin for Kubernetes is a DaemonSet that allows you to automatically:
Enable the registration of Gaudi devices in your Kubernetes cluster.
Keep track of device health.
Prerequisites¶
Intel Gaudi software drivers loaded on the system. For more details, refer to Installation Guide.
Kubernetes version listed in the Support Matrix.
Deploying Intel Gaudi Device Plugin for Kubernetes¶
Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the
kubectl create
command. Use the associated .yaml file to set up the environment:kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
Note
kubectl
requires access to a Kubernetes cluster to implement its commands. To check the access tokubectl
command, run$ kubectl get pod -A
.Check the device plugin deployment status by running the following command:
kubectl get pods -n habana-system
Expected result:
NAME READY STATUS RESTARTS AGE habanalabs-device-plugin-daemonset-qtpnh 1/1 Running 0 2d11h
Running Gaudi Jobs Example¶
You can create a Kubernetes job that acquires a Gaudi device by using
the resource.limits
field. Below is an example using Intel Gaudi’s PyTorch
container image.
Run the job:
cat <<EOF | kubectl apply -f - apiVersion: batch/v1 kind: Job metadata: name: habanalabs-gaudi-demo spec: template: spec: hostIPC: true restartPolicy: OnFailure containers: - name: habana-ai-base-container image: vault.habana.ai/gaudi-docker/1.17.1/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest workingDir: /root command: ["hl-smi"] securityContext: capabilities: add: ["SYS_NICE"] resources: limits: habana.ai/gaudi: 1 memory: 409Gi hugepages-2Mi: 95000Mi EOF
Check the pod status:
kubectl get pods
Retrieve the name of the pod and see the results:
kubectl logs <pod-name>
Note
After setting up your Kubernetes cluster, use Prometheus Metric Exporter to collect the Gaudi device metrics.
Running Gaudi MNIST Training Job Example¶
Below is an example of training a MNIST PyTorch model using Intel Gaudi’s PyTorch container image.
Create a
mnist.yaml
: file.apiVersion: batch/v1 kind: Job metadata: name: mnist-demo spec: template: spec: hostIPC: true restartPolicy: OnFailure containers: - name: mnist image: vault.habana.ai/gaudi-docker/1.17.1/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest command: ["/bin/bash", "-c"] args: - >- git clone --branch 1.17.1 https://github.com/HabanaAI/Model-References.git /Model-References; MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world; cd $MODEL_PATH; MNIST_CMD="python mnist.py \ --batch-size=64 \ --epochs=1 \ --lr=1.0 \ --gamma=0.7 \ --hpu"; mpirun -np 8 \ --allow-run-as-root \ --bind-to core \ --map-by ppr:4:socket:PE=6 \ -rank-by core --report-bindings \ --tag-output \ --merge-stderr-to-stdout --prefix $MPI_ROOT \ $MNIST_CMD; securityContext: capabilities: add: ["SYS_NICE"] resources: limits: habana.ai/gaudi: 8 memory: 409Gi hugepages-2Mi: 95000Mi requests: habana.ai/gaudi: 8 memory: 409Gi hugepages-2Mi: 95000Mi
Run the job:
kubectl apply -f mnist.yaml
Check the pod status:
kubectl get pods
Retrieve the name of the pod and see the results:
kubectl logs <pod-name>