Habana Device Plugin for Kubernetes

This is a Kubernetes device plugin implementation that enables the registration of the Habana device in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you will be able to run jobs that require a Habana device.

The Habana device plugin for Kubernetes is a Daemonset that allows you to automatically:

  • Enable the registration of Habana devices in your Kubernetes cluster.

  • Keep track of the health of your device.

Prerequisites

The below lists the prerequisites needed for running the Habana device plugin:

  • SynapseAI® Software drivers loaded on the system

  • 1.10 <= Kubernetes version <= 1.24

Deployment

Enable Habana® Gaudi® device resource support in Kubernetes.

You must run the device plugin on all the nodes that are equipped with the Habana device by deploying the following Daemonset using the kubectl create command (see the below command).

Note

kubectl needs access to a Kubernetes cluster to implement these commands.

For deployment of the device plugin, the associated .yaml file can be used to setup the environment:

$ kubectl create -f
https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml

Check the device plugin deployment status by running the following command:

$ kubectl get pods -n habana-system

Running Gaudi Jobs Example

You can create a Kubernetes Pod that acquires a Gaudi device by using the resource.limits field. This is an example using Habana’s TensorFlow/PyTorch container image:

$ cat <<EOF | kubectl apply -f -

apiVersion: v1
kind: Pod
metadata:
   name: habanalabs-gaudi-demo
spec:
   containers:
      - name: habana-ai-base-container
        image: vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.12.1:latest
        workingDir: /root
        command: ["hl-smi"]
        securityContext:
            capabilities:
               add: ["SYS_NICE"]
        resources:
            limits:
               habana.ai/gaudi: 1
EOF
$ cat <<EOF | kubectl apply -f -

apiVersion: v1
kind: Pod
metadata:
   name: habanalabs-gaudi-demo
spec:
   hostIPC: true
   containers:
      - name: habana-ai-base-container
        image: vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1
        workingDir: /root
        command: ["hl-smi"]
        securityContext:
            capabilities:
               add: ["SYS_NICE"]
        resources:
            limits:
               habana.ai/gaudi: 1
EOF

Check the pod status by running the following command:

$ kubectl get pods