Intel Gaudi Device Plugin for Kubernetes
On this Page
Intel Gaudi Device Plugin for Kubernetes¶
This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to run jobs that require a Gaudi device.
The Intel Gaudi device plugin for Kubernetes is a Daemonset that allows you to automatically:
Enable the registration of Gaudi cards in your Kubernetes cluster.
Keep track of the health of your device.
Prerequisites¶
The below lists the prerequisites needed for running the Intel Gaudi device plugin:
Intel Gaudi software drivers loaded on the system
Kubernetes version listed in the Support Matrix
Deployment¶
Enable Gaudi cards’ resource support in Kubernetes.
You must run the device plugin on all the nodes that are equipped with Gaudi by deploying the following Daemonset using the kubectl create command (see the below command).
Note
kubectl needs access to a Kubernetes cluster to implement these commands.
For deployment of the device plugin, the associated .yaml file can be used to setup the environment:
$ kubectl create -f
https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
Check the device plugin deployment status by running the following command:
$ kubectl get pods -n habana-system
Running Gaudi Jobs Example¶
You can create a Kubernetes Job that acquires a Gaudi device by using
the resource.limits
field. This is an example using Intel Gaudi’s PyTorch
container image:
$ cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: habanalabs-gaudi-demo
spec:
template:
spec:
hostIPC: true
restartPolicy: OnFailure
containers:
- name: habana-ai-base-container
image: vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
workingDir: /root
command: ["hl-smi"]
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
limits:
habana.ai/gaudi: 1
memory: 409Gi
hugepages-2Mi: 95000Mi
EOF
Check the pod status:
$ kubectl get pods
Retrieve the name of the pod and see the results:
kubectl logs <pod-name>