VMware Tanzu User Guide
On this Page
VMware Tanzu User Guide¶
VMware Tanzu provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document provides instructions on setting up a generic VMware Tanzu-based solution for an on-premise platform.
Prerequisites¶
The Intel® Gaudi® software packages required to install and load the FW and driver. For more details, refer to Installation Guide.
TensorFlow on the Intel Gaudi software version 1.14.0 or below.
The Intel Gaudi device plugin for Kubernetes. For more details, refer to Kubernetes User Guide.
Kubernetes version listed in the Support Matrix.
VMware Tanzu cluster up and running. For more details, refer to the documentation here and this tutorial from a third party.
Validating Intel Gaudi Driver¶
To verify that the Intel Gaudi driver is loaded, run the following command:
lsmod | grep habanalabsExpected result:
habanalabs 2146304 8 habanalabs_ib 73728 10 habanalabs_cn 712704 8 habanalabs_en 69632 8
To verify that the Intel Gaudi driver is loaded properly on the node, run one of the following options:
Option 1
Command:
hl-smi -d PRODUCT CLOCK -L
Output:
================ HL-SMI LOG ================ Timestamp : Sat Mar 2 17:50:36 IST 2024 Driver Version : 1.14.0-9e8ecf8 HL-SMI Version : hl-1.14.0-fw-48.0.1.0 (Jan 18 2024 - 20:20:49) Attached AIPs : 2 [0] AIP (accel0) 0000:06:00.0 Product Name : HL-225 Model Number : F08GL0AIG029A Serial Number : AM30032551 Module status : Operational Module ID : 3 PCB Assembly Version : V1A PCB Version : R0E HL Revision : 1 AIP UUID : 01P0-HL2080A0-15-TF8A78-03-05-07 AIP Status : Engineering Sample Firmware [FIT] Version : Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux Firmware [SPI] Version : Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16) Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967 Firmware [OS] Version : Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29) CPLD Version : 0x00000010
Option 2
Command:
grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-
Output:
hl0/armcp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux hl0/armcp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29) hl0/cpld_ver:0x00000010 hl0/cpucp_kernel_ver:Linux gaudi2 5.10.18-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 #1 SMP PREEMPT Sun Jan 7 20:12:35 IST 2024 aarch64 GNU/Linux hl0/cpucp_ver:arcmgmt version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29) hl0/driver_ver:1.14.0-9e8ecf8 hl0/fuse_ver:01P0-HL2080A0-15-TF8A78-03-05-07 hl0/fw_os_ver:Zephyr 2.7.2-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 7 2024 - 20:03:29) hl0/preboot_btl_ver:Preboot version hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:03:16) hl0/uboot_ver:U-Boot 2021.04-hl-gaudi2-1.14.0-fw-48.0.1-sec-7 (Jan 07 2024 - 20:11:55 +0200) build#: 10967 hl0/vrm_ver:0x04 0x04:0x00:0x00
Deploying Device Plugin¶
Run the device plugin on all the Gaudi nodes by deploying the following DaemonSet using the
kubectl create
command. Use the associated .yaml file to set up the environment:$ kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
Note
kubectl
requires access to a Kubernetes cluster to implement its commands. To check the access tokubectl
command, run$ kubectl get pod -A
.Check the device plugin deployment status by running the following command:
$ kubectl get pods -n habana-system
Expected result:
NAME READY STATUS RESTARTS AGE habanalabs-device-plugin-daemonset-qtpnh 1/1 Running 0 2d11h
Running Gaudi Jobs Example¶
You can create a Kubernetes pod that acquires a Gaudi device by using
the resource.limits
field.
The below is an example of using Intel Gaudi’s TensorFlow container image.
Run the job:
$ cat <<EOF | kubectl apply -f - apiVersion: batch/v1 kind: Job metadata: name: habanalabs-gaudi-demo2 spec: template: spec: hostIPC: true restartPolicy: OnFailure containers: - name: habana-ai-base-container2 image: vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.15.0:latest workingDir: /root command: ["hl-smi"] securityContext: capabilities: add: ["SYS_NICE"] resources: limits: habana.ai/gaudi: 1 EOF
Check the pod status:
$ kubectl get pods
Retrieve the name of the pod and see the results:
kubectl logs <pod-name>