VMware Tanzu Guide

VMware Tanzu provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document describes the steps required in order to setup a generic VMware Tanzu based solution for an on-premise setup.

Prerequisites

  • SynapseAI software packages required to install and load FW and habanalabs driver. For more details, refer to Installation Guide.

  • Device Plugin for Kubernetes from Habana vault. For more details, refer to Kubernetes User Guide.

  • VMware Tanzu cluster up and running. For more details, refer to the documentation here.

Deployment

Validating Habana Driver

  • To verify Habana Driver is loaded, run the following command:

lsmod | grep habanalabs
habanalabs 966656 0
habanalabs_en 32768 1 habanalabs
  • After loading Habana Driver, make sure to check it was loaded properly on the node. Listed below two options for validating Habana Driver:

/opt/habana/habanalabs/gaudi/hl-smi -d PRODUCT CLOCK -L
================ HL-SMI LOG ================

Timestamp                               : Tue Jan 25 17:57:58 UTC 2022

Driver Version                          : 1.3.0-124dd38
HL-SMI Version                          : hl-1.3.0-fw-32.5.0.0 (Dec 15 2021 - 05:14:10)

Attached AIPs                           : 1

[0] AIP (hl0) 0000:03:00.0
        Product Name                    : HL-200
        Model Number                    : F08GL0AI2000A
        Serial Number                   : AK30011368
        Module ID                       : N/A
        PCB Assembly Version            : V0A
        PCB Version                     : R0E
        HL Revision                     : 2
        AIP UUID                        : 00P3-HL2000B0-14-P63M75-01-01-09
        Firmware [FIT] Version          : Linux gaudi 5.10.18-hl-gaudi-1.3.0-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
        Firmware [SPI] Version          : BTL version 2f4e4ab7,Preboot version hl-gaudi-1.3.0-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:21)
        Firmware [UBOOT] Version        : U-Boot 2021.04-hl-gaudi-1.3.0-fw-32.5.0-sec-4 (Nov 08 2021 - 09:54:16 +0200) build#: 1564
grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-
hl0/armcp_kernel_ver:Linux gaudi 5.10.18-hl-gaudi-1.3.0-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
hl0/armcp_ver:armcpd version hl-gaudi-1.3.0-fw-32.5.0-sec-4 (Nov 08 2021 - 09:56:46)
hl0/cpld_ver:0x0000000f
hl0/cpucp_kernel_ver:Linux gaudi 5.10.18-hl-gaudi-1.3.0-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
hl0/cpucp_ver:armcpd version hl-gaudi-1.3.0-fw-32.5.0-sec-4 (Nov 08 2021 - 09:56:46)
hl0/driver_ver:1.3.0-124dd38
hl0/fuse_ver:00P3-HL2000B0-14-P63M75-01-01-09
hl0/infineon_ver:0x0002
hl0/preboot_btl_ver:BTL version 2f4e4ab7
hl0/preboot_btl_ver:Preboot version hl-gaudi-1.3.0-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:21)
hl0/thermal_ver:thermald version hl-gaudi-1.3.0-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:31)
hl0/uboot_ver:U-Boot 2021.04-hl-gaudi-1.3.0-fw-32.5.0-sec-4 (Nov 08 2021 - 09:54:16 +0200) build#: 1564

Device Plugin

Habana® Gaudi® device resource should be enabled to support VMware Tanzu. The device plugin must be run on all the nodes that are equipped with the Habana device by deploying the following Daemonset using the kubectl create command.

Note

kubectl requires access to a Kubernetes cluster to implement these commands. To check the access to kubectl command, run $ kubectl get pod -A.

  • To deploy the device plugin, set up the environment by using the associated .yaml:

$ kubectl create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
  • To check the device plugin deployment status, run the following command:

$ kubectl get pods -n habana-system
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
habanalabs-device-plugin-daemonset 1 1 1 1 1 <none> 155m

Running Gaudi Jobs Example

You can create a Kubernetes Pod that acquires a Gaudi device by using the resource.limits field. The below is an example of using Habana’s TensorFlow container image:

$ cat <<EOF | kubectl apply -f -

apiVersion: v1
kind: Pod
metadata:
   name: habanalabs-gaudi-demo
spec:
   containers:
   - name: habana-ai-base-container
     image: vault.habana.ai/gaudi-docker/1.3.0/ubuntu20.04/habanalabs/tensorflow-installer-tf-cpu-2.8.0:1.3.0-499
     workingDir: /root
     command: ["echo"]
     args: ["'Hello, world!'"]
     securityContext:
       capabilities:
         add: ["SYS_RAWIO"]
     resources:
       limits:
         habana.ai/gaudi: 1
EOF
  • To check the pod status, run the following command:

$ kubectl get pods