Intel Gaudi Device Plugin for Kubernetes

This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you can run jobs on the Gaudi device.

The Intel Gaudi device plugin for Kubernetes is a DaemonSet that allows you to automatically:

  • Enable the registration of Gaudi devices in your Kubernetes cluster.

  • Keep track of device health.

Prerequisites

Deploying Intel Gaudi Device Plugin for Kubernetes

  1. Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:

    $ kubectl create -f
    https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
    

    Note

    kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run $ kubectl get pod -A.

  2. Check the device plugin deployment status by running the following command:

    $ kubectl get pods -n habana-system
    

    Expected result:

    NAME                                       READY   STATUS    RESTARTS   AGE
    habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h
    

Running Gaudi Jobs Example

You can create a Kubernetes job that acquires a Gaudi device by using the resource.limits field. Below is an example using Intel Gaudi’s PyTorch container image.

  1. Run the job:

       $ cat <<EOF | kubectl apply -f -
    
       apiVersion: batch/v1
       kind: Job
       metadata:
          name: habanalabs-gaudi-demo
       spec:
          template:
             spec:
                hostIPC: true
                restartPolicy: OnFailure
                containers:
                - name: habana-ai-base-container
                   image: vault.habana.ai/gaudi-docker/1.16.2/ubuntu22.04/habanalabs/pytorch-installer-2.2.2:latest
                   workingDir: /root
                   command: ["hl-smi"]
                   securityContext:
                      capabilities:
                         add: ["SYS_NICE"]
                   resources:
                      limits:
                         habana.ai/gaudi: 1
                         memory: 409Gi
                         hugepages-2Mi: 95000Mi
       EOF
    
  2. Check the pod status:

    $ kubectl get pods
    
  3. Retrieve the name of the pod and see the results:

    kubectl logs <pod-name>