OpenShift (OCP) User Guide

OpenShift provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document describes the steps required in order to setup a generic OpenShift based solution for an on-premise setup based on CoreOS Host configuration.

The following sections detail the steps required:

For details on OpenShift, refer to OpenShift documentation.

Prerequisites

Habana provides the following components needed for deploying a generic Kubernetes solution:

  • Docker Image needed to build and load the habanalabs driver.

  • SynapseAI software packages required to install and load FW and habanalabs driver. See Installation Guide.

  • Device Plug in for Kubernetes. See Kubernetes User Guide.

The above components except Docker Image are available for download and use in the Habana Vault .

Preparation For Running Docker Image on OCP-based Host

  1. Since CoreOS is a restricted system, an overlay mount is required as /lib/firmware and /usr are read-only partitions:

modules=/opt/habana
sudo mkdir -p "$modules" "$modules.wd" mkdir -p /opt/habana/habanalabs/gaudi/
sudo mount -o "lowerdir=/lib/firmware,upperdir=$modules,workdir=$modules.wd" -t overlay overlay /lib/firmware

Note

To make this mount persistent, adding the following configuration to rc.local settings is recommended:

sudo vi /etc/rc.local
modules=/opt/habana
sudo mkdir -p "$modules" "$modules.wd"
sudo mount -o "lowerdir=/lib/firmware,upperdir=$modules,workdir=$modules.wd" -t overlay overlay /lib/firmware
  1. Assign proper permissions for rc.local system file:

sudo chmod +x /etc/rc.d/rc.local
  1. Make sure basic operations on OCP host are working properly. Run the below command for list of running cluster:

oc get nodes

Build & Run Docker Container

  1. Create a Dockerfile on Linux Host. Internet access is required.

vi Dockerfile
  1. Insert the below content into the newly created Dockerfile and save.

Note

The timezone in this configuration should match your location.

Note

To load the habanalabs driver properly, make sure to use centos:8.3.2011 to create the dockerfile.

FROM centos:8.3.2011

ARG KERNEL_VER=4.18.0-305.0.1.el8.x86_64
ARG KERNEL_SERVER_URL="http://168.63.67.169/centos/8-stream/BaseOS/x86_64/os/"
LABEL description="centos8.3 docker image for driver loading inside docker container"

LABEL kernel="${KERNEL_VER}"
ENV LC_CTYPE=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV KVERSION="${KERNEL_VER}"

RUN dnf install -y epel-release && \
    dnf groupinstall -y "Development Tools" && \
    dnf install -y --enablerepo=powertools \
    curl \
    wget \
    sudo \
    libarchive \
    lapack-devel \
    blas-devel \
    rsync \
    expect \
    rpmdevtools \
    ncurses-devel \
    pinentry \
    chrpath \
    boost \
    cmake \
    clang \
    yum-utils \
    redhat-lsb-core \
    mlocate \
    nfs-utils \
    boost-program-options \
    boost-devel \
    boost-filesystem \
    boost-static \
    cpp \
    patch \
    ncurses* \
    curl-devel \
    libarchive \
    perl-ExtUtils-MakeMaker \
    zlib-devel \
    libjpeg-devel \
    libxcrypt-static \
    glibc-static \
    lsof \
    vim-common \
    pciutils

RUN dnf clean all && \
rm -rf /var/cache/dnf

RUN cd /tmp/ && \
    mkdir -p /lib/firmware/habanalabs/gaudi && \
    wget "${KERNEL_SERVER_URL}/kernel-${KERNEL_VER}.rpm" && \
    wget "${KERNEL_SERVER_URL}/kernel-core-${KERNEL_VER}.rpm" && \
    wget "${KERNEL_SERVER_URL}/kernel-devel-${KERNEL_VER}.rpm" && \
    wget "${KERNEL_SERVER_URL}/kernel-headers-${KERNEL_VER}.rpm" && \
    wget "${KERNEL_SERVER_URL}/kernel-modules-${KERNEL_VER}.rpm" && \
    dnf install -y /tmp/kernel-* && \
    rm -rf /tmp/kernel-*

RUN ln -s "/lib/modules/${KERNEL_VER}" /lib/modules/4.18.0-305.0.1.el8 && \
    ln -s /usr/bin/lsof /usr/sbin/lsof

# Set the right timezone, stil need to overmount /etc/localtime from host
RUN echo "Asia/Jerusalem" > /etc/timezone && \
    sed -i s/Defaults/#Defaults/g /etc/sudoers && \
    echo 'coreos ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers && \
    mkdir -p /etc/udev/rules.d/ && \
    echo 'KERNEL=="hl[0-9sv]*", MODE="0666"' >> /etc/udev/rules.d/habana.rules

CMD [ "/bin/bash" ]
  1. Build and run docker container:

docker build -f Dockerfile -t hl-core-os:centos8.3 .
  1. Once build is finished, upload the image to your local docker registry. Example:

docker login <your docker registry URL>
docker push  <your docker registry URL/path (where your docker is stored) images/hl-core-os:centos8>

Note

Make sure your OCP host has access to above docker registry.

  1. Run docker container with the below command:

sudo podman run –name hl-rhel-coreos --entrypoint=bash --privileged=true -it  <your docker registry URL/path_where_your_docker_image_stored/hl-core-os:centos8>

After running the above command the docker prompt CLI will open, You are now inside the running docker container.

Load habanalabs Driver Inside Running Docker Container

To prepare for loading the driver, make sure you have two terminal windows open, one for OCP host and one inside the running container. Follow the below steps:

  1. Install the dkms utility:

cd /tmp
yum install -y dkms
cd /lib/modules
ln -s /lib/modules/4.18.0-305.0.1.el8 4.18.0-305.28.1.el8_4.x86_64
cd -
  1. Download and install packages from the vault:

wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-1.4.1-11.el8.noarch.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-firmware-1.4.1-11.el8.x86_64.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-firmware-tools-1.4.1-11.el8.x86_64.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-thunk-1.4.1-11.el8.x86_64.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanatools-1.4.1-11.el8.x86_64.rpm

rpm -ivh ./habanalabs-firmware-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanalabs-thunk-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanalabs-firmware-tools-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanatools-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanalabs-1.4.1-11.el8.noarch.rpm

Go to your OCP host and copy the above FW files from your running container to a proper place on OCP host.

  1. Obtain Docker ID of the running hl-rhel-coreos container:

sudo podman ps -ap
  1. Copy FW files (3e297ed24b36 is a Docker container ID):

sudo podman cp 3e297ed24b36:/lib/firmware/habanalabs/gaudi/gaudi-boot-fit.itb /opt/habana/habanalabs/gaudi/
sudo podman cp 3e297ed24b36:/lib/firmware/habanalabs/gaudi/gaudi-fit.itb /opt/habana/habanalabs/gaudi/
sudo podman cp 3e297ed24b36:/lib/firmware/habanalabs/gaudi/gaudi_tpc.bin /opt/habana/habanalabs/gaudi/
sudo podman cp 3e297ed24b36:/usr/bin/hl-smi /opt/habana/habanalabs/gaudi/
  1. Make sure the overlay mount set up in Preparation For Running Docker Image on OCP-based Host is working properly:

ls -lh /lib/firmware/habanalabs/gaudi/
-rwxr-xr-x. 1 root root 726K Dec 15 04:04 gaudi-boot-fit.itb
-rwxr-xr-x. 1 root root 9.9M Dec 15 04:04 gaudi-fit.itb
-rwxr-xr-x. 1 root root 1.5K Dec 15 04:02 gaudi_tpc.bin
-rwxr-xr-x. 1 root root 2.3M Dec 15 04:06 hl-smi
  1. Return to the running docker container:

podman exec -it 3e297ed24b36 /bin/bash
  1. Load kernel object:

insmod /lib/modules/4.18.0-305.28.1.el8_4.x86_64/extra/habanalabs.ko.xz
  1. Check that driver is loaded properly inside docker container:

lsmod | grep habanalabs
habanalabs 966656 0
habanalabs_en 32768 1 habanalabs
  1. Return to the OCP host and check that driver is loaded properly:

/opt/habana/habanalabs/gaudi/hl-smi -d PRODUCT CLOCK -L
================ HL-SMI LOG ================

Timestamp                               : Tue Jan 25 17:57:58 UTC 2022

Driver Version                          : 1.4.1-124dd38
HL-SMI Version                          : hl-1.4.1-fw-32.5.0.0 (Dec 15 2021 - 05:14:10)

Attached AIPs                           : 1

[0] AIP (hl0) 0000:03:00.0
        Product Name                    : HL-200
        Model Number                    : F08GL0AI2000A
        Serial Number                   : AK30011368
        Module ID                       : N/A
        PCB Assembly Version            : V0A
        PCB Version                     : R0E
        HL Revision                     : 2
        AIP UUID                        : 00P3-HL2000B0-14-P63M75-01-01-09
        Firmware [FIT] Version          : Linux gaudi 5.10.18-hl-gaudi-1.4.1-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
        Firmware [SPI] Version          : BTL version 2f4e4ab7,Preboot version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:21)
        Firmware [UBOOT] Version        : U-Boot 2021.04-hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:54:16 +0200) build#: 1564
grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-
hl0/armcp_kernel_ver:Linux gaudi 5.10.18-hl-gaudi-1.4.1-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
hl0/armcp_ver:armcpd version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:56:46)
hl0/cpld_ver:0x0000000f
hl0/cpucp_kernel_ver:Linux gaudi 5.10.18-hl-gaudi-1.4.1-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
hl0/cpucp_ver:armcpd version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:56:46)
hl0/driver_ver:1.4.1-124dd38
hl0/fuse_ver:00P3-HL2000B0-14-P63M75-01-01-09
hl0/infineon_ver:0x0002
hl0/preboot_btl_ver:BTL version 2f4e4ab7
hl0/preboot_btl_ver:Preboot version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:21)
hl0/thermal_ver:thermald version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:31)
hl0/uboot_ver:U-Boot 2021.04-hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:54:16 +0200) build#: 1564

Habana Device Plugin for Kubernetes

For details on habana device plugin for Kubernetes, refer to Habana Device Plugin for Kubernetes.

For deployment of the device plugin, the associated .yaml file can be used to setup the environment:

oc create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml

You must assign proper permissions to habana-system namespace for the pod to run in habana-system namespace.

oc adm policy add-scc-to-user privileged -z default -n habana-system

Expected output:

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR
AGE
habanalabs-device-plugin-daemonset-gaudi 1 1 1 1 1 <none> 155m

Run Demo Tests

To run demo tests, refer to the ResNet50 Keras Model Reference GitHub page.