Intel Gaudi Base Operator for Kubernetes
On this Page
Intel Gaudi Base Operator for Kubernetes¶
Intel Gaudi Base Operator for Kubernetes automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools. For a full list of the DaemonSets that are installed on a Kubernetes cluster using Intel Gaudi Base Operator, see Intel Gaudi Base Operator DaemonSets.
Note
Make sure to review the supported Kubernetes versions listed in the Support Matrix.
Make sure
kubectl
,helm
, andoc
CLIs are installed.For Intel Gaudi Base Operator, Driver and Software Installation is not required.
Deploying Intel Gaudi Base Operator¶
This section provides guidelines on how to install Intel Gaudi Base Operator and deploy ClusterPolicy. You can install Intel Gaudi Base Operator using Helm chart and on RedHat OpenShift Container Platform console or CLI as described below.
Using Helm Chart¶
Create the Operator namespace:
kubectl create namespace habana-ai-operator kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
Install Helm chart:
helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm helm repo update helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.19.1-26 -n habana-ai-operator
Using RedHat OpenShift Container Platform Console¶
Go to Operators.
Click “OperatorHub”.
In All Items field, search for Intel Gaudi AI accelerator.
Click “Install”.
Using CLI¶
Create
habana-ai-operator-install.yaml
file containing the following:--- apiVersion: v1 kind: Namespace metadata: name: habana-ai-operator --- apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: name: habana-ai-operator namespace: habana-ai-operator spec: targetNamespaces: - habana-ai-operator --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: habana-ai-operator namespace: habana-ai-operator spec: channel: stable installPlanApproval: Automatic name: habana-ai-operator source: certified-operators sourceNamespace: openshift-marketplace
Apply the yaml file:
oc apply -f habana-ai-operator-install.yaml
Creating the ClusterPolicy Instance¶
Once you deploy the Intel Gaudi Base Operator, you can create the ClusterPolicy instance using the RedHat OpenShift Container Platform console or by using CLI as described below. The ClusterPolicy is the main Custom Resource Definition (CRD) of the Intel Gaudi Base Operator. It is used for setting up or managing hardware across multiple nodes in a cluster, such as installing drivers, enabling metrics, or defining default runtime behaviors. The table below describes the required fields for creating the ClusterPolicy instance:
Component |
Field |
Description |
Scheme |
Required |
---|---|---|---|---|
image_registry |
Intel Gaudi registry URL. |
String |
Yes |
|
driver.driver_loader |
images.osname_osversion |
Driver installer repository path. |
String |
Yes |
images.osname_osversion.tag |
Driver version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
repo_server |
Driver packages repository server. |
String |
No |
|
repo_path |
Path to the repository containing driver packages. |
String |
No |
|
mlnx_ofed_repo_path |
Path to MLNX_OFED packages. |
String |
No |
|
mlnx_ofed_version |
Name of MLNX_OFED archive inside the repository. |
String |
No |
|
hugepages |
Number of hugepages. |
Integer |
No |
|
external_ports |
Enable/disable external ports. |
Boolean |
No |
|
firmware_flush |
Flush firmware on Gaudi cards. |
Boolean |
No |
|
driver.driver_runner |
image.repository |
Driver runner repository path. |
String |
Yes |
image.tag |
Driver runner version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
device_plugin |
image.repository |
Device plugin repository path. |
String |
Yes |
image.tag |
Device plugin version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
runtime.runner |
image.repository |
habana-container-runtime repository path. |
String |
Yes |
image.tag |
habana-container-runtime version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
runtime.configuration |
container_engine |
Container engine to use. |
String |
No |
engine_container_runtime_configuration |
Container engine configuration. |
String |
No |
|
habana_container_runtime_configuration |
habana-container-runtime configuration. |
String |
No |
|
metric_exporter |
runner.image.repository |
Metric exporter repository path. |
String |
Yes |
runner.image.tag |
Metric exporter version. |
String |
Yes |
|
runner.resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
runner.resources.limits.memory |
Memory resource limit. |
String |
No |
|
runner.resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
runner.resources.requests.memory |
Memory resource request. |
String |
No |
|
port |
Metric exporter port. |
Integer |
No |
|
interval |
Metric collection interval. |
Integer |
No |
|
feature_discovery |
runner.image.repository |
Feature discovery repository path. |
String |
Yes |
runner.image.tag |
Feature discovery version. |
String |
Yes |
|
runner.resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
runner.resources.limits.memory |
Memory resource limit. |
String |
No |
|
runner.resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
runner.resources.requests.memory |
Memory resource request. |
String |
No |
|
nfd_plugin |
Enable/disable feature discovery as local NDF plugin. |
Boolean |
No |
|
bmc_monitoring |
image.repository |
BMC exporter repository path. |
String |
Yes |
image.tag |
BMC exporter version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
node_selector |
Key-Value list |
New line spaced Key-Value pairs to set as the Operator’s node selector. |
String |
No |
Note
firmware_flush
is supported only on PCI-based nodes with disabled eROM write protection. For more details, refer to eROM Upgrade.For more information about cpu/memory requests and limits, see Kubernetes documentation.
Using RedHat OpenShift Container Platform Console¶
Go to Operators.
Click “Installed Operators”.
In the Name field, define the instance as
habana-ai
.Fill in the “bmc_monitoring” fields:
Fill in the “device_plugin” fields:
Fill in the “driver.driver_runner” fields:
Fill in the “driver.driver_loader” fields:
Fill in the “feature_discovery” fields:
Fill in the “metric_exporter” fields:
Fill in the “runtime.configuration” fields:
Fill in the “runtime.runner” fields:
Set “image_registry” to vault.habana.ai.
Click “Create”:
Note
Some fields are not available through the console “Form view” tab and can be configured
only in clusterpolicy.yaml
or by switching to “YAML view” tab.
Using CLI¶
Create
clusterpolicy.yaml
file containing the following:apiVersion: habanalabs.habana.ai/v1 kind: ClusterPolicy metadata: name: habana-ai spec: image_registry: vault.habana.ai driver: driver_loader: images: ubuntu_22.04: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.19.1-26 rhel_8.6: repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer tag: 1.19.1-26 rhel_9.2: repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer tag: 1.19.1-26 rhel_9.4: repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer tag: 1.19.1-26 tencentos_3.1: repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer tag: 1.19.1-26 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional repo_server: vault.habana.ai repo_path: artifactory/gaudi-installer/repos mlnx_ofed_repo_path: artifactory/gaudi-installer/deps mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz hugepages: hugepages_number_int_optional external_ports: turn_on_external_port_bool_optional firmware_flush: flush_firmware_on_the_gaudi_cards_bool_optional driver_runner: image: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.19.1-26 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional device_plugin: image: repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin tag: 1.19.1 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional runtime: runner: image: repository: vault.habana.ai/habana-ai-operator/habana-container-runtime tag: 1.19.1-26 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional configuration: container_engine: one_of_containerd_docker_crio engine_container_runtime_configuration: container_engine_configuration_optional habana_container_runtime_configuration: container_runtime_configuration_optional metric_exporter: runner: image: repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter tag: 1.19.1-26 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional port: 41611 interval: 20 feature_discovery: runner: image: repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery tag: 1.19.1-26 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional nfd_plugin: boolean_nfd_installed bmc_monitoring: image: repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter tag: 1.19.1-26 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional node_selector: key_optional: value_optional
Apply the yaml file:
kubectl apply -f clusterpolicy.yaml