Intel Gaudi Base Operator for Kubernetes
On this Page
Intel Gaudi Base Operator for Kubernetes¶
Intel Gaudi Base Operator automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools. For a full list of the DaemonSets that are installed on a Kubernetes cluster using Intel Gaudi Base Operator, see Intel Gaudi Base Operator DaemonSets.
Note
Make sure to review the supported Kubernetes versions listed in the Support Matrix.
Make sure
kubectl
andhelm
CLIs are installed.For Intel Gaudi Base Operator, Driver and Software Installation is not required.
Deploying Intel Gaudi Base Operator¶
You can install Intel Gaudi Base Operator using Helm chart as described below:
Create the Operator namespace:
kubectl create namespace habana-ai-operator kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
Install Helm chart:
helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm helm repo update helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.20.0-543 -n habana-ai-operator
Creating the ClusterPolicy Instance¶
Once you deploy the Intel Gaudi Base Operator, you can create the ClusterPolicy instance. The ClusterPolicy is the main Custom Resource Definition (CRD) of the Intel Gaudi Base Operator. It is used for setting up or managing hardware across multiple nodes in a cluster, such as installing drivers, enabling metrics, or defining default runtime behaviors. The table below describes the required fields for creating the ClusterPolicy instance:
Component |
Field |
Description |
Scheme |
Required |
---|---|---|---|---|
image_registry |
Intel Gaudi registry URL. |
String |
Yes |
|
driver.driver_loader |
images.osname_osversion |
Driver installer repository path. |
String |
Yes |
images.osname_osversion.tag |
Driver version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
repo_server |
Driver packages repository server. |
String |
No |
|
repo_path |
Path to the repository containing driver packages. |
String |
No |
|
mlnx_ofed_repo_path |
Path to MLNX_OFED packages. |
String |
No |
|
mlnx_ofed_version |
Name of MLNX_OFED archive inside the repository. |
String |
No |
|
hugepages |
Number of hugepages. |
Integer |
No |
|
external_ports |
Enable/disable external ports. |
Boolean |
No |
|
firmware_flush |
Flush firmware on Gaudi cards. |
Boolean |
No |
|
driver.driver_runner |
image.repository |
Driver runner repository path. |
String |
Yes |
image.tag |
Driver runner version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
device_plugin |
image.repository |
Device plugin repository path. |
String |
Yes |
image.tag |
Device plugin version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
runtime.runner |
image.repository |
habana-container-runtime repository path. |
String |
Yes |
image.tag |
habana-container-runtime version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
runtime.configuration |
container_engine |
Container engine to use. |
String |
No |
engine_container_runtime_configuration |
Container engine configuration. |
String |
No |
|
habana_container_runtime_configuration |
habana-container-runtime configuration. |
String |
No |
|
metric_exporter |
runner.image.repository |
Metric exporter repository path. |
String |
Yes |
runner.image.tag |
Metric exporter version. |
String |
Yes |
|
runner.resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
runner.resources.limits.memory |
Memory resource limit. |
String |
No |
|
runner.resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
runner.resources.requests.memory |
Memory resource request. |
String |
No |
|
port |
Metric exporter port. |
Integer |
No |
|
interval |
Metric collection interval. |
Integer |
No |
|
feature_discovery |
runner.image.repository |
Feature discovery repository path. |
String |
Yes |
runner.image.tag |
Feature discovery version. |
String |
Yes |
|
runner.resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
runner.resources.limits.memory |
Memory resource limit. |
String |
No |
|
runner.resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
runner.resources.requests.memory |
Memory resource request. |
String |
No |
|
nfd_plugin |
Enable/disable feature discovery as local NDF plugin. |
Boolean |
No |
|
bmc_monitoring |
image.repository |
BMC exporter repository path. |
String |
Yes |
image.tag |
BMC exporter version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
node_selector |
Key-Value list |
New line spaced Key-Value pairs to set as the Operator’s node selector. |
String |
No |
Note
firmware_flush
is supported only on PCI-based nodes with disabled eROM write protection. For more details, refer to eROM Upgrade.For more information about cpu/memory requests and limits, see Kubernetes documentation.
To created ClusterPolicy, follow the below steps:
Create
clusterpolicy.yaml
file containing the following:apiVersion: habanalabs.habana.ai/v1 kind: ClusterPolicy metadata: name: habana-ai spec: image_registry: vault.habana.ai driver: driver_loader: images: ubuntu_22.04: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.20.0-543 rhel_8.6: repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer tag: 1.20.0-543 rhel_9.2: repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer tag: 1.20.0-543 rhel_9.4: repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer tag: 1.20.0-543 tencentos_3.1: repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer tag: 1.20.0-543 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional repo_server: vault.habana.ai repo_path: artifactory/gaudi-installer/repos mlnx_ofed_repo_path: artifactory/gaudi-installer/deps mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz hugepages: hugepages_number_int_optional external_ports: turn_on_external_port_bool_optional firmware_flush: flush_firmware_on_the_gaudi_cards_bool_optional driver_runner: image: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.20.0-543 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional device_plugin: image: repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin tag: 1.20.0 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional runtime: runner: image: repository: vault.habana.ai/habana-ai-operator/habana-container-runtime tag: 1.20.0-543 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional configuration: container_engine: one_of_containerd_docker_crio engine_container_runtime_configuration: container_engine_configuration_optional habana_container_runtime_configuration: container_runtime_configuration_optional metric_exporter: runner: image: repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter tag: 1.20.0-543 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional port: 41611 interval: 20 feature_discovery: runner: image: repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery tag: 1.20.0-543 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional nfd_plugin: boolean_nfd_installed bmc_monitoring: image: repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter tag: 1.20.0-543 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional node_selector: key_optional: value_optional
Apply the yaml file:
kubectl apply -f clusterpolicy.yaml