VMware Tanzu Guide
On this Page
VMware Tanzu Guide¶
VMware Tanzu provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document describes the steps required in order to setup a generic VMware Tanzu based solution for an on-premise setup.
Prerequisites¶
Intel® Gaudi® software packages required to install and load the FW and driver. For more details, refer to Installation Guide.
Device Plugin for Kubernetes from Intel Gaudi vault. For more details, refer to Kubernetes User Guide.
VMware Tanzu cluster up and running. For more details, refer to the documentation here and this tutorial from a third party.
Deployment¶
Note
For the following examples, make sure to use TensorFlow on Intel Gaudi software version 1.14.0 or below.
Validating Intel Gaudi Driver¶
To verify Intel Gaudi driver is loaded, run the following command:
lsmod | grep habanalabs
habanalabs 2146304 8
habanalabs_ib 73728 10
habanalabs_cn 712704 8
habanalabs_en 69632 8
After loading the Intel Gaudi driver, make sure to check it was loaded properly on the node. The below lists two options for validating the driver:
hl-smi -d PRODUCT CLOCK -L
================ HL-SMI LOG ================
Timestamp : Sat Mar 2 17:50:36 IST 2024
Driver Version : 1.14.0-9e8ecf8
HL-SMI Version : hl-1.14.0-fw-48.0.1.0 (Jan 18 2024 - 20:20:49)
Attached AIPs : 2
[0] AIP (accel0) 0000:06:00.0
Product Name : HL-225
Model Number : F08GL0AIG029A
Serial Number : AM30032551
Module status : Operational
Module ID : 3
PCB Assembly Version : V1A
PCB Version : R0E
HL Revision : 1
AIP UUID : 01P0-HL2080A0-15-TF8A78-03-05-07
AIP Status : Engineering Sample
Firmware [FIT] Version : Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
Firmware [SPI] Version : Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
Firmware [OS] Version : Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29)
CPLD Version : 0x00000010
grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-
hl0/armcp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
hl0/armcp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29)
hl0/cpld_ver:0x00000010
hl0/cpucp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux
hl0/cpucp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29)
hl0/driver_ver:1.14.0-9e8ecf8
hl0/fuse_ver:01P0-HL2080A0-15-TF8A78-03-05-07
hl0/fw_os_ver:Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29)
hl0/preboot_btl_ver:Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16)
hl0/uboot_ver:U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967
hl0/vrm_ver:0x04 0x04:0x00:0x00
Device Plugin¶
Gaudi device resource should be enabled to support VMware Tanzu. The device plugin must be run on all the nodes that are equipped with
Gaudi by deploying the following Daemonset using the kubectl create
command.
Note
kubectl
requires access to a Kubernetes cluster to implement these commands.
To check the access to kubectl
command, run $ kubectl get pod -A
.
To deploy the device plugin, set up the environment by using the associated .yaml:
$ kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
To check the device plugin deployment status, run the following command:
$ kubectl get pods -n habana-system
NAME READY STATUS RESTARTS AGE
habanalabs-device-plugin-daemonset-qtpnh 1/1 Running 0 2d11h
Running Gaudi Jobs Example¶
You can create a Kubernetes Pod that acquires a Gaudi device by using the resource.limits field. The below is an example of using Intel Gaudi’s TensorFlow container image:
$ cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: habanalabs-gaudi-demo2
spec:
template:
spec:
hostIPC: true
restartPolicy: OnFailure
containers:
- name: habana-ai-base-container2
image: vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.15.0:latest
workingDir: /root
command: ["hl-smi"]
securityContext:
capabilities:
add: ["SYS_NICE"]
resources:
limits:
habana.ai/gaudi: 1
EOF
To check the pod status, run the following command:
$ kubectl get pods
Find your pod and then use kubectl logs to checkout the log.