Intel Gaudi Base Operator for Kubernetes
On this Page
Intel Gaudi Base Operator for Kubernetes¶
Intel® Gaudi® Base Operator for Kubernetes automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools. This document provides instructions for deploying the Operator.
Prerequisites¶
kubectl
andhelm
CLIs installed.Kubernetes version listed in the Support Matrix.
Deploying the Operator¶
Install the Operator on a cluster by deploying a Helm chart:
Create the Operator namespace:
kubectl create namespace habana-ai-operator kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
Install Helm chart:
helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm helm repo update helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.17.1-40 -n habana-ai-operator
DaemonSets Installation¶
The Operator installs the following six DaemonSets on a Kubernetes cluster:
Intel Gaudi Feature Discovery - Labels each Kubernetes pod with information about the node it is running on. These labels contain information about the Gaudi device availability and the driver version. These labels help deploy other DaemonSets to the appropriate nodes. The DaemonSet relies on a
node_selector
field specified in the cluster policy, without additional label selectors or affinities.Driver - Loads the driver on the cluster node by running the
habanalabs-installer.sh
installation script. The DaemonSet has an affinity for a label provided by the Intel Gaudi Feature Discovery DaemonSet, ensuring that the driver is installed only on servers with the Gaudi devices.Intel Gaudi Container Runtime - Exposes the Intel Gaudi network and uverbs interfaces for the pod. The DaemonSet copies runtime binaries to the host, configures the engine’s default runtime, and restarts the container engine to load the new configuration. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node.
Intel Gaudi Device Plugin for Kubernetes - Lists the available Gaudi devices and exposes them to kubelet as a resource available for workloads as
habana.ai/gaudi
. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Intel Gaudi Device Plugin for Kubernetes.Prometheus Metric Exporter - Exports metrics about the node and the Gaudi devices on it. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Prometheus Metric Exporter.
BMC Exporter - Exports metrics by utilizing the Redfish protocol to scrape the node’s BMC. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see BMC Exporter User Guide.
To deploy these DaemonSets manually, create the clusterpolicy.yaml file containing the following:
apiVersion: habanalabs.habana.ai/v1 kind: ClusterPolicy metadata: name: habana-ai spec: image_registry: vault.habana.ai driver: driver_loader: images: - os: ubuntu_22.04 repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.17.1-40 - os: rhel_8.6 repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer tag: 1.17.1-40 - os: rhel_9.2 repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer tag: 1.17.1-40 - os: tencentos_3.1 repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer tag: 1.17.1-40 - os: amzn_2 repository: vault.habana.ai/habana-ai-operator/driver/amzn2/driver-installer tag: 1.17.1-40 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional hugepages: hugepages_number_int_optional external_ports: turn_on_external_port_bool_optional driver_runner: image: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.17.1-40 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional device_plugin: image: repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin tag: 1.17.1 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional runtime: runner: image: repository: vault.habana.ai/habana-ai-operator/habana-container-runtime tag: 1.17.1-40 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional configuration: container_engine: container_engine_name engine_container_runtime_configuration: container_engine_configuration_optional habana_container_runtime_configuration: container_runtime_configuration_optional metric_exporter: runner: image: repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter/ tag: 1.17.1-40 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional port: 41611 interval: 20 feature_discovery: runner: image: repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery tag: 1.17.1-40 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional nfd_plugin: boolean_nfd_installed bmc_monitoring: image: repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter tag: 1.17.1-40 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional
Apply the yaml file by running the below:
kubectl apply -f clusterpolicy.yaml