Kubernetes Installation

Kubernetes provides an efficient and manageable way to orchestrate deep learning workloads at scale. To deploy a generic Kubernetes solution for an on-premise platform or as a baseline in a larger cloud configuration, Intel® Gaudi® provides the following components that can be downloaded from the Intel Gaudi vault.

Once installed, refer to Running Workloads on Kubernetes.

Intel Gaudi Base Operator for Kubernetes

Intel® Gaudi® Base Operator for Kubernetes automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools.

Prerequisites

Deploying Intel Gaudi Base Operator

Install the Operator on a cluster by deploying a Helm chart:

  1. Create the Operator namespace:

    kubectl create namespace habana-ai-operator
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
    
  2. Install Helm chart:

    helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
    helm repo update
    helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.18.0-524 -n habana-ai-operator
    

Intel Gaudi Base Operator DaemonSets

The Operator installs the following six DaemonSets on a Kubernetes cluster:

  • Intel Gaudi Feature Discovery - Labels each Kubernetes pod with information about the node it is running on. These labels contain information about the Gaudi device availability and the driver version. These labels help deploy other DaemonSets to the appropriate nodes. The DaemonSet relies on a node_selector field specified in the cluster policy, without additional label selectors or affinities.

  • Driver - Loads the driver on the cluster node by running the habanalabs-installer.sh installation script. The DaemonSet has an affinity for a label provided by the Intel Gaudi Feature Discovery DaemonSet, ensuring that the driver is installed only on servers with the Gaudi devices.

  • habana-container-runtime - Exposes the Intel Gaudi network and uverbs interfaces for the pod. The DaemonSet copies runtime binaries to the host, configures the engine’s default runtime, and restarts the container engine to load the new configuration. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node.

  • Intel Gaudi Device Plugin for Kubernetes - Lists the available Gaudi devices and exposes them to kubelet as a resource available for workloads as habana.ai/gaudi. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Intel Gaudi Device Plugin for Kubernetes.

  • Prometheus Metric Exporter - Exports metrics about the node and the Gaudi devices on it. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Prometheus Metric Exporter.

  • BMC Exporter - Exports metrics by utilizing the Redfish protocol to scrape the node’s BMC. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see BMC Exporter User Guide.

Deploying ClusterPolicy Manually

  1. To deploy ClusterPolicy manually, create the clusterpolicy.yaml file containing the following:

    apiVersion: habanalabs.habana.ai/v1
    kind: ClusterPolicy
    metadata:
      name: habana-ai
    spec:
      image_registry: vault.habana.ai
      driver:
        driver_loader:
          images:
            ubuntu_22.04:
              repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
              tag: 1.18.0-524
            rhel_8.6:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer
              tag: 1.18.0-524
            rhel_9.2:
              repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer
              tag: 1.18.0-524
            tencentos_3.1:
              repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer
              tag: 1.18.0-524
            amzn_2:
              repository: vault.habana.ai/habana-ai-operator/driver/amzn2/driver-installer
              tag: 1.18.0-524
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
          repo_server: vault.habana.ai
          repo_path: artifactory/gaudi-installer/repos
          mlnx_ofed_repo_path: artifactory/gaudi-installer/deps
          mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz
          hugepages: hugepages_number_int_optional
          external_ports: turn_on_external_port_bool_optional
          firmware_flush: flush_firmware_on_the_gaudi_cards_bool_optional
        driver_runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
            tag: 1.18.0-524
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
      device_plugin:
        image:
          repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
          tag: 1.18.0
        resources:
          limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
          requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
      runtime:
        runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/habana-container-runtime
            tag: 1.18.0-524
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
        configuration:
          container_engine: one_of_containerd_docker_crio
          engine_container_runtime_configuration: container_engine_configuration_optional
          habana_container_runtime_configuration: container_runtime_configuration_optional
      metric_exporter:
        runner:
          image:
            repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter
            tag: 1.18.0-524
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
        port: 41611
        interval: 20
      feature_discovery:
        runner:
          image:
            repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery
            tag: 1.18.0-524
          resources:
            limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
            requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
        nfd_plugin: boolean_nfd_installed
      bmc_monitoring:
        image:
          repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter
          tag: 1.18.0-524
        resources:
          limits:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
          requests:
              cpu: cpu_str_or_int_optional
              memory: memory_str_optional
      node_selector:
        key_optional: value_optional
    

    Note

    firmware_flush is supported only on PCI-based nodes with disabled eROM write protection. For more details, refer to Firmware Upgrade.

  2. Apply the yaml file by running the below:

    kubectl apply -f clusterpolicy.yaml
    

Intel Gaudi Device Plugin for Kubernetes

This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you can run jobs on the Gaudi device.

The Intel Gaudi device plugin for Kubernetes is a DaemonSet that allows you to automatically:

  • Enable the registration of Gaudi devices in your Kubernetes cluster.

  • Keep track of device health.

Prerequisites

Deploying Intel Gaudi Device Plugin for Kubernetes

  1. Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:

    kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
    

    Note

    kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run kubectl get pod -A.

  2. Check the device plugin deployment status by running the following command:

    kubectl get pods -n habana-system
    

    Expected result:

    NAME                                       READY   STATUS    RESTARTS   AGE
    habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h