Intel Gaudi Base Operator for Kubernetes

Intel Gaudi Base Operator automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools. For a full list of the DaemonSets that are installed on a Kubernetes cluster using Intel Gaudi Base Operator, see Intel Gaudi Base Operator DaemonSets.

Note

Deploying Intel Gaudi Base Operator

You can install Intel Gaudi Base Operator using Helm chart as described below:

  1. Create the Operator namespace:

    kubectl create namespace habana-ai-operator
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
    
  2. Install Helm chart:

    helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
    helm repo update
    helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.21.1-16 -n habana-ai-operator
    

Once the Helm chart is deployed, it creates a basic ClusterPolicy instance that launches the Intel Gaudi Base Operator. The instance can be used as is or customized as explained in the below section.

Modifying ClusterPolicy Instance

Once you deploy the Intel Gaudi Base Operator, you can modify the ClusterPolicy instance. The ClusterPolicy is the main Custom Resource Definition (CRD) of the Intel Gaudi Base Operator. It is used for setting up or managing hardware across multiple nodes in a cluster, such as installing drivers, enabling metrics, or defining default runtime behaviors. The table below describes the required fields in the ClusterPolicy instance:

Component

Field

Description

Scheme

Required

image_registry

Intel Gaudi registry URL.

String

Yes

driver.driver_loader

images.osname_osversion

Driver installer repository path.

String

Yes

images.osname_osversion.tag

Driver version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

repo_server

Driver packages repository server.

String

No

repo_path

Path to the repository containing driver packages.

String

No

mlnx_ofed_repo_path

Path to MLNX_OFED packages.

String

No

mlnx_ofed_version

Name of MLNX_OFED archive inside the repository.

String

No

hugepage

Number of hugepages.

Integer

No

external_ports

Enable/disable external ports.

Boolean

No

firmware_flush

Flush firmware on Gaudi cards.

Boolean

No

driver.driver_runner

image.repository

Driver runner repository path.

String

Yes

image.tag

Driver runner version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

device_plugin

image.repository

Device plugin repository path.

String

Yes

image.tag

Device plugin version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

runtime.runner

image.repository

habana-container-runtime repository path.

String

Yes

image.tag

habana-container-runtime version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

runtime.configuration

container_engine

Container engine to use.

String

No

engine_container_runtime_configuration

Container engine configuration.

String

No

habana_container_runtime_configuration

habana-container-runtime configuration.

String

No

metric_exporter

runner.image.repository

Metric exporter repository path.

String

Yes

runner.image.tag

Metric exporter version.

String

Yes

runner.resources.limits.cpu

CPU resource limit.

String/Integer

No

runner.resources.limits.memory

Memory resource limit.

String

No

runner.resources.requests.cpu

CPU resource request.

String/Integer

No

runner.resources.requests.memory

Memory resource request.

String

No

port

Metric exporter port.

Integer

No

interval

Metric collection interval.

Integer

No

feature_discovery

runner.image.repository

Feature discovery repository path.

String

Yes

runner.image.tag

Feature discovery version.

String

Yes

runner.resources.limits.cpu

CPU resource limit.

String/Integer

No

runner.resources.limits.memory

Memory resource limit.

String

No

runner.resources.requests.cpu

CPU resource request.

String/Integer

No

runner.resources.requests.memory

Memory resource request.

String

No

nfd_plugin

Enable/disable feature discovery as local NDF plugin.

Boolean

No

bmc_monitoring

image.repository

BMC exporter repository path.

String

Yes

image.tag

BMC exporter version.

String

Yes

resources.limits.cpu

CPU resource limit.

String/Integer

No

resources.limits.memory

Memory resource limit.

String

No

resources.requests.cpu

CPU resource request.

String/Integer

No

resources.requests.memory

Memory resource request.

String

No

node_selector

Key-Value list

New line spaced Key-Value pairs to set as the Operator’s node selector.

String

No

Note

  • firmware_flush is supported only on PCI-based nodes with disabled eROM write protection. For more details, refer to eROM Upgrade.

  • For more information about cpu/memory requests and limits, see Kubernetes documentation.

To customize ClusterPolicy, follow the below steps:

  1. Modify the clusterpolicy.yaml file by using the example below. Choose one option under the container_engine field:

    apiVersion: habanalabs.habana.ai/v1
    kind: ClusterPolicy
    metadata:
      name: habana-ai
    spec:
      image_registry: vault.habana.ai
      driver:
        driver_loader:
          images:
            ubuntu_22.04:
              repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
              tag: 1.21.1-16
            rhel_8.6:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer
              tag: 1.21.1-16
            rhel_9.2:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer
              tag: 1.21.1-16
            rhel_9.4:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer
              tag: 1.21.1-16
            tencentos_3.1:
              repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer
              tag: 1.21.1-16
          resources:
            limits:
              cpu: 500m
              memory: 2Gi
            requests:
              cpu: 100m
              memory: 512Mi
          repo_server: vault.habana.ai
          repo_path: artifactory/gaudi-installer/repos
          mlnx_ofed_repo_path: artifactory/gaudi-installer/deps
          mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz
          # hugepage: hugepages_number_int_optional
          external_ports: false
          firmware_flush: false
        driver_runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
            tag: 1.21.1-16
          resources:
            limits:
              cpu: 20m
              memory: 64Mi
            requests:
              cpu: 10m
              memory: 32Mi
      device_plugin:
        image:
          repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
          tag: 1.21.1
        resources:
          limits:
            cpu: 20m
            memory: 64Mi
          requests:
            cpu: 10m
            memory: 32Mi
      runtime:
        runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/habana-container-runtime
            tag: 1.21.1-16
          resources:
            limits:
              cpu: 20m
              memory: 64Mi
            requests:
              cpu: 10m
              memory: 32Mi
        configuration:
          container_engine: {containerd / docker / crio}
          engine_container_runtime_configuration: ""
          habana_container_runtime_configuration: ""
      metric_exporter:
        runner:
          image:
            repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter
            tag: 1.21.1-16
          resources:
            limits:
              cpu: 150m
              memory: 120Mi
            requests:
              cpu: 100m
              memory: 100Mi
        port: 41611
        interval: 20
      feature_discovery:
        runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery
            tag: 1.21.1-16
          resources:
            limits:
              cpu: 20m
              memory: 64Mi
            requests:
              cpu: 10m
              memory: 32Mi
        nfd_plugin: false
      bmc_monitoring:
        image:
          repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter
          tag: 1.21.1-16
        resources:
          limits:
            cpu: 250m
            memory: 250Mi
          requests:
            cpu: 150m
            memory: 100Mi
      # node_selector:
      #   key_optional: value_optional
    
  2. Apply the yaml file:

    kubectl apply -f clusterpolicy.yaml