Kubernetes Installation
On this Page
Kubernetes Installation¶
Kubernetes provides an efficient and manageable way to orchestrate deep learning workloads at scale. To deploy a generic Kubernetes solution for an on-premise platform or as a baseline in a larger cloud configuration, Intel® Gaudi® provides the following components that can be downloaded from the Intel Gaudi vault.
Intel Gaudi Base Operator for Kubernetes - Recommended method of installation. Installs all the necessary Intel Gaudi software components within your Kubernetes cluster and allows to automate their management.
Intel Gaudi Device Plugin for Kubernetes - The minimum required installation allowing you to access the Intel Gaudi devices in Kubernetes.
Once installed, refer to Running Kubernetes Workloads with Gaudi.
Intel Gaudi Base Operator for Kubernetes¶
Intel® Gaudi® Base Operator for Kubernetes automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools.
Prerequisites¶
Make sure to review the supported Kubernetes versions listed in the Support Matrix.
Make sure
kubectl
andhelm
CLIs are installed.For Intel Gaudi Base Operator, Driver and Software Installation is not required.
Deploying Intel Gaudi Base Operator¶
Install the Operator on a cluster by deploying a Helm chart:
Create the Operator namespace:
kubectl create namespace habana-ai-operator kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
Install Helm chart:
helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm helm repo update helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.18.0-524 -n habana-ai-operator
Intel Gaudi Base Operator DaemonSets¶
The Operator installs the following six DaemonSets on a Kubernetes cluster:
Intel Gaudi Feature Discovery - Labels each Kubernetes pod with information about the node it is running on. These labels contain information about the Gaudi device availability and the driver version. These labels help deploy other DaemonSets to the appropriate nodes. The DaemonSet relies on a
node_selector
field specified in the cluster policy, without additional label selectors or affinities.Driver - Loads the driver on the cluster node by running the
habanalabs-installer.sh
installation script. The DaemonSet has an affinity for a label provided by the Intel Gaudi Feature Discovery DaemonSet, ensuring that the driver is installed only on servers with the Gaudi devices.habana-container-runtime - Exposes the Intel Gaudi network and uverbs interfaces for the pod. The DaemonSet copies runtime binaries to the host, configures the engine’s default runtime, and restarts the container engine to load the new configuration. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node.
Intel Gaudi Device Plugin for Kubernetes - Lists the available Gaudi devices and exposes them to kubelet as a resource available for workloads as
habana.ai/gaudi
. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Intel Gaudi Device Plugin for Kubernetes.Prometheus Metric Exporter - Exports metrics about the node and the Gaudi devices on it. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Prometheus Metric Exporter.
BMC Exporter - Exports metrics by utilizing the Redfish protocol to scrape the node’s BMC. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see BMC Exporter User Guide.
Deploying ClusterPolicy Manually¶
To deploy ClusterPolicy manually, create the clusterpolicy.yaml file containing the following:
apiVersion: habanalabs.habana.ai/v1 kind: ClusterPolicy metadata: name: habana-ai spec: image_registry: vault.habana.ai driver: driver_loader: images: ubuntu_22.04: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.18.0-524 rhel_8.6: repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer tag: 1.18.0-524 rhel_9.2: repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer tag: 1.18.0-524 tencentos_3.1: repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer tag: 1.18.0-524 amzn_2: repository: vault.habana.ai/habana-ai-operator/driver/amzn2/driver-installer tag: 1.18.0-524 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional repo_server: vault.habana.ai repo_path: artifactory/gaudi-installer/repos mlnx_ofed_repo_path: artifactory/gaudi-installer/deps mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz hugepages: hugepages_number_int_optional external_ports: turn_on_external_port_bool_optional driver_runner: image: repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer tag: 1.18.0-524 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional device_plugin: image: repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin tag: 1.18.0 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional runtime: runner: image: repository: vault.habana.ai/habana-ai-operator/habana-container-runtime tag: 1.18.0-524 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional configuration: container_engine: one_of_containerd_docker_crio engine_container_runtime_configuration: container_engine_configuration_optional habana_container_runtime_configuration: container_runtime_configuration_optional metric_exporter: runner: image: repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter tag: 1.18.0-524 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional port: 41611 interval: 20 feature_discovery: runner: image: repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery tag: 1.18.0-524 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional nfd_plugin: boolean_nfd_installed bmc_monitoring: image: repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter tag: 1.18.0-524 resources: limits: cpu: cpu_str_or_int_optional memory: memory_str_optional requests: cpu: cpu_str_or_int_optional memory: memory_str_optional node_selector: key_optional: value_optional
Apply the yaml file by running the below:
kubectl apply -f clusterpolicy.yaml
Intel Gaudi Device Plugin for Kubernetes¶
This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you can run jobs on the Gaudi device.
The Intel Gaudi device plugin for Kubernetes is a DaemonSet that allows you to automatically:
Enable the registration of Gaudi devices in your Kubernetes cluster.
Keep track of device health.
Prerequisites¶
Make sure to review the supported Kubernetes versions listed in the Support Matrix.
Make sure Intel Gaudi software drivers are loaded on the system. Refer to Driver and Software Installation.
Deploying Intel Gaudi Device Plugin for Kubernetes¶
Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the
kubectl create
command. Use the associated .yaml file to set up the environment:kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
Note
kubectl
requires access to a Kubernetes cluster to implement its commands. To check the access tokubectl
command, runkubectl get pod -A
.Check the device plugin deployment status by running the following command:
kubectl get pods -n habana-system
Expected result:
NAME READY STATUS RESTARTS AGE habanalabs-device-plugin-daemonset-qtpnh 1/1 Running 0 2d11h