VMware Tanzu User Guide

VMware Tanzu provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document provides instructions on setting up a generic VMware Tanzu-based solution for an on-premise platform.

Prerequisites

  • The Intel® Gaudi® software packages required to install and load the FW and driver. For more details, refer to Installation Guide.

  • TensorFlow on the Intel Gaudi software version 1.14.0 or below.

  • The Intel Gaudi device plugin for Kubernetes. For more details, refer to Kubernetes User Guide.

  • Kubernetes version listed in the Support Matrix.

  • VMware Tanzu cluster up and running. For more details, refer to the documentation here and this tutorial from a third party.

Validating Intel Gaudi Driver

To verify that the Intel Gaudi driver is loaded, run the following command:

lsmod | grep habanalabs

Expected result:

habanalabs           2146304  8
habanalabs_ib          73728  10
habanalabs_cn         712704  8
habanalabs_en          69632  8

To verify that the Intel Gaudi driver is loaded properly on the node, run one of the following options:

  • Option 1

    Command:

    hl-smi -d PRODUCT CLOCK -L
    

    Output:

    ================ HL-SMI LOG ================
    
    Timestamp                               : Sat Mar  2 17:50:36 IST 2024
    Driver Version                          : 1.14.0-9e8ecf8
    HL-SMI Version                          : hl-1.14.0-fw-48.0.1.0 (Jan 18 2024 - 20:20:49)
    
    Attached AIPs                           : 2
    
    [0] AIP (accel0) 0000:06:00.0
          Product Name                    : HL-225
          Model Number                    : F08GL0AIG029A
          Serial Number                   : AM30032551
          Module status                   : Operational
          Module ID                       : 3
          PCB Assembly Version            : V1A
          PCB Version                     : R0E
          HL Revision                     : 1
          AIP UUID                        : 01P0-HL2080A0-15-TF8A78-03-05-07
          AIP Status                      : Engineering Sample
          Firmware [FIT] Version          : Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
          Firmware [SPI] Version          : Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
          Firmware [UBOOT] Version        : U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
          Firmware [OS] Version           : Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
          CPLD Version                    : 0x00000010
    
  • Option 2

    Command:

    grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-
    

    Output:

    hl0/armcp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
    hl0/armcp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
    hl0/cpld_ver:0x00000010
    hl0/cpucp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
    hl0/cpucp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
    hl0/driver_ver:1.14.0-9e8ecf8
    hl0/fuse_ver:01P0-HL2080A0-15-TF8A78-03-05-07
    hl0/fw_os_ver:Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
    hl0/preboot_btl_ver:Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
    hl0/uboot_ver:U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
    hl0/vrm_ver:0x04 0x04:0x00:0x00
    

Deploying Device Plugin

  1. Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:

    $ kubectl create -f
    https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
    

    Note

    kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run $ kubectl get pod -A.

  2. Check the device plugin deployment status by running the following command:

    $ kubectl get pods -n habana-system
    

    Expected result:

    NAME                                       READY   STATUS    RESTARTS   AGE
    habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h
    

Running Gaudi Jobs Example

You can create a Kubernetes pod that acquires a Gaudi device by using the resource.limits field. The below is an example of using Intel Gaudi’s TensorFlow container image.

  1. Run the job:

    $ cat <<EOF | kubectl apply -f -
    apiVersion: batch/v1
    kind: Job
    metadata:
       name: habanalabs-gaudi-demo2
    spec:
       template:
          spec:
             hostIPC: true
             restartPolicy: OnFailure
             containers:
             - name: habana-ai-base-container2
                image: vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.15.0:latest
                workingDir: /root
                command: ["hl-smi"]
                securityContext:
                   capabilities:
                      add: ["SYS_NICE"]
                resources:
                   limits:
                      habana.ai/gaudi: 1
    EOF
    
  2. Check the pod status:

    $ kubectl get pods
    
  3. Retrieve the name of the pod and see the results:

    kubectl logs <pod-name>