VMware Tanzu Guide

VMware Tanzu provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document describes the steps required in order to setup a generic VMware Tanzu based solution for an on-premise setup.

Prerequisites

  • Intel® Gaudi® software packages required to install and load the FW and driver. For more details, refer to Installation Guide.

  • Device Plugin for Kubernetes from Intel Gaudi vault. For more details, refer to Kubernetes User Guide.

  • VMware Tanzu cluster up and running. For more details, refer to the documentation here and this tutorial from a third party.

Deployment

Note

For the following examples, make sure to use TensorFlow on Intel Gaudi software version 1.14.0 or below.

Validating Intel Gaudi Driver

  • To verify Intel Gaudi driver is loaded, run the following command:

lsmod | grep habanalabs
habanalabs           2146304  8
habanalabs_ib          73728  10
habanalabs_cn         712704  8
habanalabs_en          69632  8
  • After loading the Intel Gaudi driver, make sure to check it was loaded properly on the node. The below lists two options for validating the driver:

hl-smi -d PRODUCT CLOCK -L
================ HL-SMI LOG ================

Timestamp                               : Sat Mar  2 17:50:36 IST 2024
Driver Version                          : 1.14.0-9e8ecf8
HL-SMI Version                          : hl-1.14.0-fw-48.0.1.0 (Jan 18 2024 - 20:20:49)

Attached AIPs                           : 2

[0] AIP (accel0) 0000:06:00.0
        Product Name                    : HL-225
        Model Number                    : F08GL0AIG029A
        Serial Number                   : AM30032551
        Module status                   : Operational
        Module ID                       : 3
        PCB Assembly Version            : V1A
        PCB Version                     : R0E
        HL Revision                     : 1
        AIP UUID                        : 01P0-HL2080A0-15-TF8A78-03-05-07
        AIP Status                      : Engineering Sample
        Firmware [FIT] Version          : Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
        Firmware [SPI] Version          : Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
        Firmware [UBOOT] Version        : U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
        Firmware [OS] Version           : Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
        CPLD Version                    : 0x00000010
grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-
hl0/armcp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
hl0/armcp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
hl0/cpld_ver:0x00000010
hl0/cpucp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
hl0/cpucp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
hl0/driver_ver:1.14.0-9e8ecf8
hl0/fuse_ver:01P0-HL2080A0-15-TF8A78-03-05-07
hl0/fw_os_ver:Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
hl0/preboot_btl_ver:Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
hl0/uboot_ver:U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
hl0/vrm_ver:0x04 0x04:0x00:0x00

Device Plugin

Gaudi device resource should be enabled to support VMware Tanzu. The device plugin must be run on all the nodes that are equipped with Gaudi by deploying the following Daemonset using the kubectl create command.

Note

kubectl requires access to a Kubernetes cluster to implement these commands. To check the access to kubectl command, run $ kubectl get pod -A.

  • To deploy the device plugin, set up the environment by using the associated .yaml:

$ kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
  • To check the device plugin deployment status, run the following command:

$ kubectl get pods -n habana-system
NAME                                       READY   STATUS    RESTARTS   AGE
habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h

Running Gaudi Jobs Example

You can create a Kubernetes Pod that acquires a Gaudi device by using the resource.limits field. The below is an example of using Intel Gaudi’s TensorFlow container image:

$ cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
   name: habanalabs-gaudi-demo2
spec:
   template:
      spec:
         hostIPC: true
         restartPolicy: OnFailure
         containers:
         - name: habana-ai-base-container2
            image: vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.15.0:latest
            workingDir: /root
            command: ["hl-smi"]
            securityContext:
               capabilities:
                  add: ["SYS_NICE"]
            resources:
               limits:
                  habana.ai/gaudi: 1
EOF
  • To check the pod status, run the following command:

$ kubectl get pods

Find your pod and then use kubectl logs to checkout the log.