Intel Gaudi Base Operator for Kubernetes

Intel Gaudi Base Operator for Kubernetes automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools. For a full list of the DaemonSets that are installed on a Kubernetes cluster using Intel Gaudi Base Operator, see Intel Gaudi Base Operator DaemonSets.

Note

Deploying Intel Gaudi Base Operator

This section provides guidelines on how to install Intel Gaudi Base Operator and deploy ClusterPolicy. You can install Intel Gaudi Base Operator using Helm chart and on RedHat OpenShift Container Platform console or CLI as described below.

Using Helm Chart

  1. Create the Operator namespace:

    kubectl create namespace habana-ai-operator
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
    
  2. Install Helm chart:

    helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
    helm repo update
    helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.19.1-26 -n habana-ai-operator
    

Using RedHat OpenShift Container Platform Console

  1. Go to Operators.

  2. Click “OperatorHub”.

  3. In All Items field, search for Intel Gaudi AI accelerator.

  4. Click “Install”.

    ../../../_images/Intel_Gaudi_Base_Operator_Installation1.png

Using CLI

  1. Create habana-ai-operator-install.yaml file containing the following:

    ---
    apiVersion: v1
    kind: Namespace
    metadata:
       name: habana-ai-operator
    ---
    apiVersion: operators.coreos.com/v1
    kind: OperatorGroup
    metadata:
       name: habana-ai-operator
       namespace: habana-ai-operator
    spec:
       targetNamespaces:
       - habana-ai-operator
    ---
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
       name: habana-ai-operator
       namespace: habana-ai-operator
    spec:
       channel: stable
       installPlanApproval: Automatic
       name: habana-ai-operator
       source: certified-operators
       sourceNamespace: openshift-marketplace
    
  2. Apply the yaml file:

    oc apply -f habana-ai-operator-install.yaml
    

Creating the ClusterPolicy Instance

Once you deploy the Intel Gaudi Base Operator, you can create the ClusterPolicy instance using the RedHat OpenShift Container Platform console or by using CLI as described below. The ClusterPolicy is the main Custom Resource Definition (CRD) of the Intel Gaudi Base Operator. It is used for setting up or managing hardware across multiple nodes in a cluster, such as installing drivers, enabling metrics, or defining default runtime behaviors. The table below describes the required fields for creating the ClusterPolicy instance:

Component

Field

Description

Scheme

Required

image_registry

Intel Gaudi registry URL.

String

Yes

driver.driver_loader

images.osname_osversion

Driver installer repository path.

String

Yes

images.osname_osversion.tag

Driver version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

repo_server

Driver packages repository server.

String

No

repo_path

Path to the repository containing driver packages.

String

No

mlnx_ofed_repo_path

Path to MLNX_OFED packages.

String

No

mlnx_ofed_version

Name of MLNX_OFED archive inside the repository.

String

No

hugepages

Number of hugepages.

Integer

No

external_ports

Enable/disable external ports.

Boolean

No

firmware_flush

Flush firmware on Gaudi cards.

Boolean

No

driver.driver_runner

image.repository

Driver runner repository path.

String

Yes

image.tag

Driver runner version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

device_plugin

image.repository

Device plugin repository path.

String

Yes

image.tag

Device plugin version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

runtime.runner

image.repository

habana-container-runtime repository path.

String

Yes

image.tag

habana-container-runtime version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

runtime.configuration

container_engine

Container engine to use.

String

No

engine_container_runtime_configuration

Container engine configuration.

String

No

habana_container_runtime_configuration

habana-container-runtime configuration.

String

No

metric_exporter

runner.image.repository

Metric exporter repository path.

String

Yes

runner.image.tag

Metric exporter version.

String

Yes

runner.resources.limits.cpu

CPU resource limit.

String/Integer

No

runner.resources.limits.memory

Memory resource limit.

String

No

runner.resources.requests.cpu

CPU resource request.

String/Integer

No

runner.resources.requests.memory

Memory resource request.

String

No

port

Metric exporter port.

Integer

No

interval

Metric collection interval.

Integer

No

feature_discovery

runner.image.repository

Feature discovery repository path.

String

Yes

runner.image.tag

Feature discovery version.

String

Yes

runner.resources.limits.cpu

CPU resource limit.

String/Integer

No

runner.resources.limits.memory

Memory resource limit.

String

No

runner.resources.requests.cpu

CPU resource request.

String/Integer

No

runner.resources.requests.memory

Memory resource request.

String

No

nfd_plugin

Enable/disable feature discovery as local NDF plugin.

Boolean

No

bmc_monitoring

image.repository

BMC exporter repository path.

String

Yes

image.tag

BMC exporter version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

node_selector

Key-Value list

New line spaced Key-Value pairs to set as the Operator’s node selector.

String

No

Note

  • firmware_flush is supported only on PCI-based nodes with disabled eROM write protection. For more details, refer to eROM Upgrade.

  • For more information about cpu/memory requests and limits, see Kubernetes documentation.

Using RedHat OpenShift Container Platform Console

  1. Go to Operators.

  2. Click “Installed Operators”.

  3. In the Name field, define the instance as habana-ai.

    ../../../_images/Create_Clusterpolicy_Instance_1.png
  4. Fill in the “bmc_monitoring” fields:

    ../../../_images/Create_Clusterpolicy_Instance_2.png
  5. Fill in the “device_plugin” fields:

    ../../../_images/Create_Clusterpolicy_Instance_3.png
  6. Fill in the “driver.driver_runner” fields:

    ../../../_images/Create_Clusterpolicy_Instance_4.png
  7. Fill in the “driver.driver_loader” fields:

    ../../../_images/Create_Clusterpolicy_Instance_5.png
  8. Fill in the “feature_discovery” fields:

    ../../../_images/Create_Clusterpolicy_Instance_6.png
  9. Fill in the “metric_exporter” fields:

    ../../../_images/Create_Clusterpolicy_Instance_7.png
  10. Fill in the “runtime.configuration” fields:

    ../../../_images/Create_Clusterpolicy_Instance_8.png
  11. Fill in the “runtime.runner” fields:

    ../../../_images/Create_Clusterpolicy_Instance_9.png
  12. Set “image_registry” to vault.habana.ai.

  13. Click “Create”:

    ../../../_images/Create_Clusterpolicy_Instance_10.png

Note

Some fields are not available through the console “Form view” tab and can be configured only in clusterpolicy.yaml or by switching to “YAML view” tab.

Using CLI

  1. Create clusterpolicy.yaml file containing the following:

    apiVersion: habanalabs.habana.ai/v1
    kind: ClusterPolicy
    metadata:
      name: habana-ai
    spec:
      image_registry: vault.habana.ai
      driver:
        driver_loader:
          images:
            ubuntu_22.04:
              repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
              tag: 1.19.1-26
            rhel_8.6:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer
              tag: 1.19.1-26
            rhel_9.2:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer
              tag: 1.19.1-26
            rhel_9.4:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer
              tag: 1.19.1-26
            tencentos_3.1:
              repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer
              tag: 1.19.1-26
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
          repo_server: vault.habana.ai
          repo_path: artifactory/gaudi-installer/repos
          mlnx_ofed_repo_path: artifactory/gaudi-installer/deps
          mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz
          hugepages: hugepages_number_int_optional
          external_ports: turn_on_external_port_bool_optional
          firmware_flush: flush_firmware_on_the_gaudi_cards_bool_optional
        driver_runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
            tag: 1.19.1-26
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
      device_plugin:
        image:
          repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
          tag: 1.19.1
        resources:
          limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
          requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
      runtime:
        runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/habana-container-runtime
            tag: 1.19.1-26
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
        configuration:
          container_engine: one_of_containerd_docker_crio
          engine_container_runtime_configuration: container_engine_configuration_optional
          habana_container_runtime_configuration: container_runtime_configuration_optional
      metric_exporter:
        runner:
          image:
            repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter
            tag: 1.19.1-26
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
        port: 41611
        interval: 20
      feature_discovery:
        runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery
            tag: 1.19.1-26
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
        nfd_plugin: boolean_nfd_installed
      bmc_monitoring:
        image:
          repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter
          tag: 1.19.1-26
        resources:
          limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
          requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
      node_selector:
        key_optional: value_optional
    
  2. Apply the yaml file:

    kubectl apply -f clusterpolicy.yaml