Intel Gaudi Device Plugin for Kubernetes¶

This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you can run jobs on the Gaudi device.

The Intel Gaudi device plugin for Kubernetes is a DaemonSet that allows you to automatically:

Enable the registration of Gaudi devices in your Kubernetes cluster.
Keep track of device health.

Prerequisites¶

Intel Gaudi software drivers loaded on the system. For more details, refer to Installation Guide.
Kubernetes version listed in the Support Matrix.

Deploying Intel Gaudi Device Plugin for Kubernetes¶

Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:
```
kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
```
Note

kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run kubectl get pod -A.

Check the device plugin deployment status by running the following command:

kubectl get pods -n habana-system

Expected result:

NAME                                       READY   STATUS    RESTARTS   AGE
habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h

Running Gaudi Jobs Example¶

You can create a Kubernetes job that acquires a Gaudi device by using the resource.limits field. Below is an example using Intel Gaudi’s PyTorch container image.

Run the job:

cat <<EOF | kubectl apply -f -

apiVersion: batch/v1
kind: Job
metadata:
   name: habanalabs-gaudi-demo
spec:
   template:
      spec:
         hostIPC: true
         restartPolicy: OnFailure
         containers:
          - name: habana-ai-base-container
            image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
            workingDir: /root
            command: ["hl-smi"]
            securityContext:
               capabilities:
                  add: ["SYS_NICE"]
            resources:
               limits:
                  habana.ai/gaudi: 1
                  memory: 409Gi
                  hugepages-2Mi: 95000Mi
EOF

Check the pod status:
```
kubectl get pods
```
Retrieve the name of the pod and see the results:
```
kubectl logs <pod-name>
```

Note

After setting up your Kubernetes cluster, use Prometheus Metric Exporter to collect the Gaudi device metrics.

Running Gaudi MNIST Training Job Example¶

Below is an example of training a MNIST PyTorch model using Intel Gaudi’s PyTorch container image.

Create a mnist.yaml file:

apiVersion: batch/v1
kind: Job
metadata:
  name: mnist-demo
spec:
  template:
    spec:
      hostIPC: true
      restartPolicy: OnFailure
      containers:
        - name: mnist
          image: vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
          command: ["/bin/bash", "-c"]
          args:
            - >-
              git clone --branch 1.18.0 https://github.com/HabanaAI/Model-References.git /Model-References;
              MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world;
              cd $MODEL_PATH;

              MNIST_CMD="python mnist.py \
                --batch-size=64 \
                --epochs=1 \
                --lr=1.0 \
                --gamma=0.7 \
                --hpu";

              mpirun -np 8 \
                --allow-run-as-root \
                --bind-to core \
                --map-by ppr:4:socket:PE=6 \
                -rank-by core --report-bindings \
                --tag-output \
                --merge-stderr-to-stdout --prefix $MPI_ROOT \
                $MNIST_CMD;
          securityContext:
            capabilities:
              add: ["SYS_NICE"]
          resources:
            limits:
              habana.ai/gaudi: 8
              memory: 409Gi
              hugepages-2Mi: 95000Mi
            requests:
              habana.ai/gaudi: 8
              memory: 409Gi
              hugepages-2Mi: 95000Mi

Run the job:
```
kubectl apply -f mnist.yaml
```
Check the pod status:
```
kubectl get pods
```
Retrieve the name of the pod and see the results:
```
kubectl logs <pod-name>
```

Gaudi Documentation 1.18.0 documentation

Intel Gaudi Device Plugin for Kubernetes

On this Page

Intel Gaudi Device Plugin for Kubernetes¶

Prerequisites¶

Deploying Intel Gaudi Device Plugin for Kubernetes¶

Running Gaudi Jobs Example¶

Running Gaudi MNIST Training Job Example¶