Intel Gaudi Device Plugin for Kubernetes

This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you can run jobs on the Gaudi device.

The Intel Gaudi device plugin for Kubernetes is a DaemonSet that allows you to automatically:

  • Enable the registration of Gaudi devices in your Kubernetes cluster.

  • Keep track of device health.

Prerequisites

Deploying Intel Gaudi Device Plugin for Kubernetes

  1. Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:

    kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
    

    Note

    kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run $ kubectl get pod -A.

  2. Check the device plugin deployment status by running the following command:

    kubectl get pods -n habana-system
    

    Expected result:

    NAME                                       READY   STATUS    RESTARTS   AGE
    habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h
    

Running Gaudi Jobs Example

You can create a Kubernetes job that acquires a Gaudi device by using the resource.limits field. Below is an example using Intel Gaudi’s PyTorch container image.

  1. Run the job:

    cat <<EOF | kubectl apply -f -
    
    apiVersion: batch/v1
    kind: Job
    metadata:
       name: habanalabs-gaudi-demo
    spec:
       template:
          spec:
             hostIPC: true
             restartPolicy: OnFailure
             containers:
              - name: habana-ai-base-container
                image: vault.habana.ai/gaudi-docker/1.17.1/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
                workingDir: /root
                command: ["hl-smi"]
                securityContext:
                   capabilities:
                      add: ["SYS_NICE"]
                resources:
                   limits:
                      habana.ai/gaudi: 1
                      memory: 409Gi
                      hugepages-2Mi: 95000Mi
    EOF
    
  2. Check the pod status:

    kubectl get pods
    
  3. Retrieve the name of the pod and see the results:

    kubectl logs <pod-name>
    

Note

After setting up your Kubernetes cluster, use Prometheus Metric Exporter to collect the Gaudi device metrics.

Running Gaudi MNIST Training Job Example

Below is an example of training a MNIST PyTorch model using Intel Gaudi’s PyTorch container image.

  1. Create a mnist.yaml: file.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: mnist-demo
    spec:
      template:
        spec:
          hostIPC: true
          restartPolicy: OnFailure
          containers:
            - name: mnist
              image: vault.habana.ai/gaudi-docker/1.17.1/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
              command: ["/bin/bash", "-c"]
              args:
                - >-
                  git clone --branch 1.17.1 https://github.com/HabanaAI/Model-References.git /Model-References;
                  MODEL_PATH=/Model-References/PyTorch/examples/computer_vision/hello_world;
                  cd $MODEL_PATH;
    
                  MNIST_CMD="python mnist.py \
                    --batch-size=64 \
                    --epochs=1 \
                    --lr=1.0 \
                    --gamma=0.7 \
                    --hpu";
    
                  mpirun -np 8 \
                    --allow-run-as-root \
                    --bind-to core \
                    --map-by ppr:4:socket:PE=6 \
                    -rank-by core --report-bindings \
                    --tag-output \
                    --merge-stderr-to-stdout --prefix $MPI_ROOT \
                    $MNIST_CMD;
              securityContext:
                capabilities:
                  add: ["SYS_NICE"]
              resources:
                limits:
                  habana.ai/gaudi: 8
                  memory: 409Gi
                  hugepages-2Mi: 95000Mi
                requests:
                  habana.ai/gaudi: 8
                  memory: 409Gi
                  hugepages-2Mi: 95000Mi
    
  2. Run the job:

    kubectl apply -f mnist.yaml
    
  3. Check the pod status:

    kubectl get pods
    
  4. Retrieve the name of the pod and see the results:

    kubectl logs <pod-name>