Intel Gaudi Base Operator for Kubernetes

Intel® Gaudi® Base Operator for Kubernetes automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools. This document provides instructions for deploying the Operator.

Prerequisites

  • kubectl and helm CLIs installed.

  • Kubernetes version listed in the Support Matrix.

Deploying the Operator

Install the Operator on a cluster by deploying a Helm chart:

  1. Create the Operator namespace:

    kubectl create namespace habana-ai-operator
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
    
  2. Install Helm chart:

    helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
    helm repo update
    helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.17.1-40 -n habana-ai-operator
    

DaemonSets Installation

The Operator installs the following six DaemonSets on a Kubernetes cluster:

  • Intel Gaudi Feature Discovery - Labels each Kubernetes pod with information about the node it is running on. These labels contain information about the Gaudi device availability and the driver version. These labels help deploy other DaemonSets to the appropriate nodes. The DaemonSet relies on a node_selector field specified in the cluster policy, without additional label selectors or affinities.

  • Driver - Loads the driver on the cluster node by running the habanalabs-installer.sh installation script. The DaemonSet has an affinity for a label provided by the Intel Gaudi Feature Discovery DaemonSet, ensuring that the driver is installed only on servers with the Gaudi devices.

  • Intel Gaudi Container Runtime - Exposes the Intel Gaudi network and uverbs interfaces for the pod. The DaemonSet copies runtime binaries to the host, configures the engine’s default runtime, and restarts the container engine to load the new configuration. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node.

  • Intel Gaudi Device Plugin for Kubernetes - Lists the available Gaudi devices and exposes them to kubelet as a resource available for workloads as habana.ai/gaudi. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Intel Gaudi Device Plugin for Kubernetes.

  • Prometheus Metric Exporter - Exports metrics about the node and the Gaudi devices on it. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Prometheus Metric Exporter.

  • BMC Exporter - Exports metrics by utilizing the Redfish protocol to scrape the node’s BMC. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see BMC Exporter User Guide.

To deploy these DaemonSets manually, create the clusterpolicy.yaml file containing the following:

apiVersion: habanalabs.habana.ai/v1
kind: ClusterPolicy
metadata:
  name: habana-ai
spec:
  image_registry: vault.habana.ai
  driver:
    driver_loader:
      images:
        - os: ubuntu_22.04
          repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
          tag: 1.17.1-40
        - os: rhel_8.6
          repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer
          tag: 1.17.1-40
        - os: rhel_9.2
          repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer
          tag: 1.17.1-40
        - os: tencentos_3.1
          repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer
          tag: 1.17.1-40
        - os: amzn_2
          repository: vault.habana.ai/habana-ai-operator/driver/amzn2/driver-installer
          tag: 1.17.1-40
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
      hugepages: hugepages_number_int_optional
      external_ports: turn_on_external_port_bool_optional
    driver_runner:
      image:
        repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
        tag: 1.17.1-40
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
  device_plugin:
    image:
      repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
      tag: 1.17.1
    resources:
      limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
      requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
  runtime:
    runner:
      image:
        repository: vault.habana.ai/habana-ai-operator/habana-container-runtime
        tag: 1.17.1-40
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
    configuration:
      container_engine: container_engine_name
      engine_container_runtime_configuration: container_engine_configuration_optional
      habana_container_runtime_configuration: container_runtime_configuration_optional
  metric_exporter:
    runner:
      image:
        repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter/
        tag: 1.17.1-40
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
    port: 41611
    interval: 20
  feature_discovery:
    runner:
      image:
        repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery
        tag: 1.17.1-40
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
    nfd_plugin: boolean_nfd_installed
  bmc_monitoring:
    image:
      repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter
      tag: 1.17.1-40
    resources:
      limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
      requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional

Apply the yaml file by running the below:

kubectl apply -f clusterpolicy.yaml