Intel Gaudi Base Operator for Kubernetes
On this Page
Intel Gaudi Base Operator for Kubernetes¶
Intel Gaudi Base Operator automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools. For a full list of the DaemonSets that are installed on a Kubernetes cluster using Intel Gaudi Base Operator, see Intel Gaudi Base Operator DaemonSets.
Note
Make sure to review the supported Kubernetes versions listed in the Support Matrix.
Make sure
kubectl
andhelm
CLIs are installed.For Intel Gaudi Base Operator, Driver and Software Installation is not required.
Deploying Intel Gaudi Base Operator¶
You can install Intel Gaudi Base Operator using Helm chart as described below:
Create the Operator namespace:
kubectl create namespace habana-ai-operator kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
Install Helm chart:
helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm helm repo update helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.21.1-16 -n habana-ai-operator
Once the Helm chart is deployed, it creates a basic ClusterPolicy instance that launches the Intel Gaudi Base Operator. The instance can be used as is or customized as explained in the below section.
Modifying ClusterPolicy Instance¶
Once you deploy the Intel Gaudi Base Operator, you can modify the ClusterPolicy instance. The ClusterPolicy is the main Custom Resource Definition (CRD) of the Intel Gaudi Base Operator. It is used for setting up or managing hardware across multiple nodes in a cluster, such as installing drivers, enabling metrics, or defining default runtime behaviors. The table below describes the required fields in the ClusterPolicy instance:
Component |
Field |
Description |
Scheme |
Required |
---|---|---|---|---|
image_registry |
Intel Gaudi registry URL. |
String |
Yes |
|
driver.driver_loader |
images.osname_osversion |
Driver installer repository path. |
String |
Yes |
images.osname_osversion.tag |
Driver version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
repo_server |
Driver packages repository server. |
String |
No |
|
repo_path |
Path to the repository containing driver packages. |
String |
No |
|
mlnx_ofed_repo_path |
Path to MLNX_OFED packages. |
String |
No |
|
mlnx_ofed_version |
Name of MLNX_OFED archive inside the repository. |
String |
No |
|
hugepage |
Number of hugepages. |
Integer |
No |
|
external_ports |
Enable/disable external ports. |
Boolean |
No |
|
firmware_flush |
Flush firmware on Gaudi cards. |
Boolean |
No |
|
driver.driver_runner |
image.repository |
Driver runner repository path. |
String |
Yes |
image.tag |
Driver runner version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
device_plugin |
image.repository |
Device plugin repository path. |
String |
Yes |
image.tag |
Device plugin version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
runtime.runner |
image.repository |
habana-container-runtime repository path. |
String |
Yes |
image.tag |
habana-container-runtime version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
runtime.configuration |
container_engine |
Container engine to use. |
String |
No |
engine_container_runtime_configuration |
Container engine configuration. |
String |
No |
|
habana_container_runtime_configuration |
habana-container-runtime configuration. |
String |
No |
|
metric_exporter |
runner.image.repository |
Metric exporter repository path. |
String |
Yes |
runner.image.tag |
Metric exporter version. |
String |
Yes |
|
runner.resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
runner.resources.limits.memory |
Memory resource limit. |
String |
No |
|
runner.resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
runner.resources.requests.memory |
Memory resource request. |
String |
No |
|
port |
Metric exporter port. |
Integer |
No |
|
interval |
Metric collection interval. |
Integer |
No |
|
feature_discovery |
runner.image.repository |
Feature discovery repository path. |
String |
Yes |
runner.image.tag |
Feature discovery version. |
String |
Yes |
|
runner.resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
runner.resources.limits.memory |
Memory resource limit. |
String |
No |
|
runner.resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
runner.resources.requests.memory |
Memory resource request. |
String |
No |
|
nfd_plugin |
Enable/disable feature discovery as local NDF plugin. |
Boolean |
No |
|
bmc_monitoring |
image.repository |
BMC exporter repository path. |
String |
Yes |
image.tag |
BMC exporter version. |
String |
Yes |
|
resources.limits.cpu |
CPU resource limit. |
String/Integer |
No |
|
resources.limits.memory |
Memory resource limit. |
String |
No |
|
resources.requests.cpu |
CPU resource request. |
String/Integer |
No |
|
resources.requests.memory |
Memory resource request. |
String |
No |
|
node_selector |
Key-Value list |
New line spaced Key-Value pairs to set as the Operator’s node selector. |
String |
No |
Note
firmware_flush
is supported only on PCI-based nodes with disabled eROM write protection. For more details, refer to eROM Upgrade.For more information about cpu/memory requests and limits, see Kubernetes documentation.
To customize ClusterPolicy, follow the below steps:
Modify the
clusterpolicy.yaml
file by using the example below. Choose one option under thecontainer_engine
field:apiVersion: habanalabs.habana.ai/v1 kind: ClusterPolicy metadata: name: habana-ai spec: image_registry: vault.habana.ai driver: driver_loader: images: ubuntu_22.04: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.21.1-16 rhel_8.6: repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer tag: 1.21.1-16 rhel_9.2: repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer tag: 1.21.1-16 rhel_9.4: repository: vault.habana.ai/habana-ai-operator/driver/rhel9.4/driver-installer tag: 1.21.1-16 tencentos_3.1: repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer tag: 1.21.1-16 resources: limits: cpu: 500m memory: 2Gi requests: cpu: 100m memory: 512Mi repo_server: vault.habana.ai repo_path: artifactory/gaudi-installer/repos mlnx_ofed_repo_path: artifactory/gaudi-installer/deps mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz # hugepage: hugepages_number_int_optional external_ports: false firmware_flush: false driver_runner: image: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.21.1-16 resources: limits: cpu: 20m memory: 64Mi requests: cpu: 10m memory: 32Mi device_plugin: image: repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin tag: 1.21.1 resources: limits: cpu: 20m memory: 64Mi requests: cpu: 10m memory: 32Mi runtime: runner: image: repository: vault.habana.ai/habana-ai-operator/habana-container-runtime tag: 1.21.1-16 resources: limits: cpu: 20m memory: 64Mi requests: cpu: 10m memory: 32Mi configuration: container_engine: {containerd / docker / crio} engine_container_runtime_configuration: "" habana_container_runtime_configuration: "" metric_exporter: runner: image: repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter tag: 1.21.1-16 resources: limits: cpu: 150m memory: 120Mi requests: cpu: 100m memory: 100Mi port: 41611 interval: 20 feature_discovery: runner: image: repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery tag: 1.21.1-16 resources: limits: cpu: 20m memory: 64Mi requests: cpu: 10m memory: 32Mi nfd_plugin: false bmc_monitoring: image: repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter tag: 1.21.1-16 resources: limits: cpu: 250m memory: 250Mi requests: cpu: 150m memory: 100Mi # node_selector: # key_optional: value_optional
Apply the yaml file:
kubectl apply -f clusterpolicy.yaml