VMware Tanzu User Guide¶

VMware Tanzu provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document provides instructions on setting up a generic VMware Tanzu-based solution for an on-premise platform.

Prerequisites¶

The Intel® Gaudi® software packages required to install and load the FW and driver. For more details, refer to Installation Guide.
TensorFlow on the Intel Gaudi software version 1.14.0 or below.
The Intel Gaudi device plugin for Kubernetes. For more details, refer to Kubernetes User Guide.
Kubernetes version listed in the Support Matrix.
VMware Tanzu cluster up and running. For more details, refer to the documentation here and this tutorial from a third party.

Validating Intel Gaudi Driver¶

To verify that the Intel Gaudi driver is loaded, run the following command:

lsmod | grep habanalabs

Expected result:

habanalabs           2146304  8
habanalabs_ib          73728  10
habanalabs_cn         712704  8
habanalabs_en          69632  8

To verify that the Intel Gaudi driver is loaded properly on the node, run one of the following options:

Option 1

Command:

hl-smi -d PRODUCT CLOCK -L

Output:

================ HL-SMI LOG ================

Timestamp                               : Sat Mar  2 17:50:36 IST 2024
Driver Version                          : 1.14.0-9e8ecf8
HL-SMI Version                          : hl-1.14.0-fw-48.0.1.0 (Jan 18 2024 - 20:20:49)

Attached AIPs                           : 2

[0] AIP (accel0) 0000:06:00.0
      Product Name                    : HL-225
      Model Number                    : F08GL0AIG029A
      Serial Number                   : AM30032551
      Module status                   : Operational
      Module ID                       : 3
      PCB Assembly Version            : V1A
      PCB Version                     : R0E
      HL Revision                     : 1
      AIP UUID                        : 01P0-HL2080A0-15-TF8A78-03-05-07
      AIP Status                      : Engineering Sample
      Firmware [FIT] Version          : Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
      Firmware [SPI] Version          : Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
      Firmware [UBOOT] Version        : U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
      Firmware [OS] Version           : Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
      CPLD Version                    : 0x00000010

Option 2

Command:

grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-

Output:

hl0/armcp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
hl0/armcp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
hl0/cpld_ver:0x00000010
hl0/cpucp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
hl0/cpucp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
hl0/driver_ver:1.14.0-9e8ecf8
hl0/fuse_ver:01P0-HL2080A0-15-TF8A78-03-05-07
hl0/fw_os_ver:Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan  7 2024 - 20:03:29)
hl0/preboot_btl_ver:Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
hl0/uboot_ver:U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
hl0/vrm_ver:0x04 0x04:0x00:0x00

Deploying Device Plugin¶

Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the kubectl create command. Use the associated .yaml file to set up the environment:
```
$ kubectl create -f
https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
```
Note

kubectl requires access to a Kubernetes cluster to implement its commands. To check the access to kubectl command, run $ kubectl get pod -A.

Check the device plugin deployment status by running the following command:

$ kubectl get pods -n habana-system

Expected result:

NAME                                       READY   STATUS    RESTARTS   AGE
habanalabs-device-plugin-daemonset-qtpnh   1/1     Running   0          2d11h

Running Gaudi Jobs Example¶

You can create a Kubernetes pod that acquires a Gaudi device by using the resource.limits field. The below is an example of using Intel Gaudi’s TensorFlow container image.

Run the job:

$ cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
   name: habanalabs-gaudi-demo2
spec:
   template:
      spec:
         hostIPC: true
         restartPolicy: OnFailure
         containers:
         - name: habana-ai-base-container2
            image: vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.15.0:latest
            workingDir: /root
            command: ["hl-smi"]
            securityContext:
               capabilities:
                  add: ["SYS_NICE"]
            resources:
               limits:
                  habana.ai/gaudi: 1
EOF

Check the pod status:
```
$ kubectl get pods
```
Retrieve the name of the pod and see the results:
```
kubectl logs <pod-name>
```

Gaudi Documentation 1.21.1 documentation

VMware Tanzu User Guide

On this Page

VMware Tanzu User Guide¶

Prerequisites¶

Validating Intel Gaudi Driver¶

Deploying Device Plugin¶

Running Gaudi Jobs Example¶