Kubernetes Installation

Kubernetes provides an efficient and manageable way to orchestrate deep learning workloads at scale. To deploy a generic Kubernetes solution for an on-premise platform or as a baseline in a larger cloud configuration, Intel® Gaudi® provides the following components that can be downloaded from the Intel Gaudi vault.

Once installed, refer to Running Kubernetes Workloads with Gaudi.

Intel Gaudi Base Operator for Kubernetes

Intel® Gaudi® Base Operator for Kubernetes automates the management of all necessary Intel Gaudi software components on a Kubernetes cluster. These include drivers, Kubernetes device plugin, container runtime, feature discovery, and monitoring tools.

Prerequisites

Deploying Intel Gaudi Base Operator

Install the Operator on a cluster by deploying a Helm chart:

  1. Create the Operator namespace:

    kubectl create namespace habana-ai-operator
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/enforce=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/audit=privileged --overwrite
    kubectl label namespace habana-ai-operator pod-security.kubernetes.io/warn=privileged --overwrite
    
  2. Install Helm chart:

    helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
    helm repo update
    helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.18.0-524 -n habana-ai-operator
    

Intel Gaudi Base Operator DaemonSets

The Operator installs the following six DaemonSets on a Kubernetes cluster:

  • Intel Gaudi Feature Discovery - Labels each Kubernetes pod with information about the node it is running on. These labels contain information about the Gaudi device availability and the driver version. These labels help deploy other DaemonSets to the appropriate nodes. The DaemonSet relies on a node_selector field specified in the cluster policy, without additional label selectors or affinities.

  • Driver - Loads the driver on the cluster node by running the habanalabs-installer.sh installation script. The DaemonSet has an affinity for a label provided by the Intel Gaudi Feature Discovery DaemonSet, ensuring that the driver is installed only on servers with the Gaudi devices.

  • habana-container-runtime - Exposes the Intel Gaudi network and uverbs interfaces for the pod. The DaemonSet copies runtime binaries to the host, configures the engine’s default runtime, and restarts the container engine to load the new configuration. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node.

  • Intel Gaudi Device Plugin for Kubernetes - Lists the available Gaudi devices and exposes them to kubelet as a resource available for workloads as habana.ai/gaudi. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Intel Gaudi Device Plugin for Kubernetes.

  • Prometheus Metric Exporter - Exports metrics about the node and the Gaudi devices on it. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see Prometheus Metric Exporter.

  • BMC Exporter - Exports metrics by utilizing the Redfish protocol to scrape the node’s BMC. This DaemonSet has an affinity for labels indicating that Gaudi device is available and the driver has been loaded on the node. For more details, see BMC Exporter User Guide.

Deploying ClusterPolicy Manually

To deploy ClusterPolicy manually, create the clusterpolicy.yaml file containing the following:

apiVersion: habanalabs.habana.ai/v1
kind: ClusterPolicy
metadata:
  name: habana-ai
spec:
  image_registry: vault.habana.ai
  driver:
    driver_loader:
      images:
        ubuntu_22.04:
          repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
          tag: 1.18.0-524
        rhel_8.6:
          repository: vault.habana.ai/habana-ai-operator/driver/rhel8.6/driver-installer
          tag: 1.18.0-524
        rhel_9.2:
          repository: vault.habana.ai/habana-ai-operator/driver/rhel9.2/driver-installer
          tag: 1.18.0-524
        tencentos_3.1:
          repository: vault.habana.ai/habana-ai-operator/driver/tencentos3.1/driver-installer
          tag: 1.18.0-524
        amzn_2:
          repository: vault.habana.ai/habana-ai-operator/driver/amzn2/driver-installer
          tag: 1.18.0-524
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
      repo_server: vault.habana.ai
      repo_path: artifactory/gaudi-installer/repos
      mlnx_ofed_repo_path: artifactory/gaudi-installer/deps
      mlnx_ofed_version: mlnx-ofed-5.8-2.0.3.0-rhel8.4-x86_64.tar.gz
      hugepages: hugepages_number_int_optional
      external_ports: turn_on_external_port_bool_optional
    driver_runner:
      image:
        repository: vault.habana.ai/habana-ai-operator/driver/ubuntu22.04/driver-installer
        tag: 1.18.0-524
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
  device_plugin:
    image:
      repository: vault.habana.ai/docker-k8s-device-plugin/docker-k8s-device-plugin
      tag: 1.18.0
    resources:
      limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
      requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
  runtime:
    runner:
      image:
        repository: vault.habana.ai/habana-ai-operator/habana-container-runtime
        tag: 1.18.0-524
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
    configuration:
      container_engine: one_of_containerd_docker_crio
      engine_container_runtime_configuration: container_engine_configuration_optional
      habana_container_runtime_configuration: container_runtime_configuration_optional
  metric_exporter:
    runner:
      image:
        repository: vault.habana.ai/gaudi-metric-exporter/metric-exporter
        tag: 1.18.0-524
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
    port: 41611
    interval: 20
  feature_discovery:
    runner:
      image:
        repository: vault.habana.ai/habana-ai-operator/habanalabs-feature-discovery
        tag: 1.18.0-524
      resources:
        limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
        requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
    nfd_plugin: boolean_nfd_installed
  bmc_monitoring:
    image:
      repository: vault.habana.ai/habana-bmc-exporter/bmc-exporter
      tag: 1.18.0-524
    resources:
      limits:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
      requests:
          cpu: cpu_str_or_int_optional
          memory: memory_str_optional
  node_selector:
    key_optional: value_optional

Apply the yaml file by running the below:

kubectl apply -f clusterpolicy.yaml

Intel Gaudi Device Plugin for Kubernetes

This is a Kubernetes device plugin implementation that enables the registration of the Intel® Gaudi® AI accelerator in a container cluster for compute workload. With the appropriate hardware and this plugin deployed in your Kubernetes cluster, you can run jobs on the Gaudi device.

The Intel Gaudi device plugin for Kubernetes is a DaemonSet that allows you to automatically:

  • Enable the registration of Gaudi devices in your Kubernetes cluster.

  • Keep track of device health.

Prerequisites

Deploying Intel Gaudi Device Plugin for Kubernetes

  1. Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:

    kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
    

    Note

    kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run kubectl get pod -A.

  2. Check the device plugin deployment status by running the following command:

    kubectl get pods -n habana-system
    

    Expected result:

    NAME                                       READY   STATUS    RESTARTS   AGE
    habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h