OpenShift (OCP) User Guide¶
OpenShift provides an efficient and manageable way to orchestrate deep learning workloads at scale. This document describes the steps required in order to setup a generic OpenShift based solution for an on-premise setup based on CoreOS Host configuration.
The following sections detail the steps required:
For details on OpenShift, refer to OpenShift documentation.
Prerequisites¶
Habana provides the following components needed for deploying a generic Kubernetes solution:
Docker Image needed to build and load the habanalabs driver.
SynapseAI software packages required to install and load FW and habanalabs driver. See Installation Guide.
Device Plug in for Kubernetes. See Kubernetes User Guide.
The above components except Docker Image are available for download and use in the Habana Vault .
Preparation For Running Docker Image on OCP-based Host¶
Since CoreOS is a restricted system, an overlay mount is required as
/lib/firmware
and/usr
are read-only partitions:
modules=/opt/habana
sudo mkdir -p "$modules" "$modules.wd" mkdir -p /opt/habana/habanalabs/gaudi/
sudo mount -o "lowerdir=/lib/firmware,upperdir=$modules,workdir=$modules.wd" -t overlay overlay /lib/firmware
Note
To make this mount persistent, adding the following configuration to rc.local
settings is recommended:
sudo vi /etc/rc.local
modules=/opt/habana
sudo mkdir -p "$modules" "$modules.wd"
sudo mount -o "lowerdir=/lib/firmware,upperdir=$modules,workdir=$modules.wd" -t overlay overlay /lib/firmware
Assign proper permissions for
rc.local
system file:
sudo chmod +x /etc/rc.d/rc.local
Make sure basic operations on OCP host are working properly. Run the below command for list of running cluster:
oc get nodes
Build & Run Docker Container¶
Create a Dockerfile on Linux Host. Internet access is required.
vi Dockerfile
Insert the below content into the newly created Dockerfile and save.
Note
The timezone in this configuration should match your location.
Note
To load the habanalabs driver properly, make sure to use centos:8.3.2011
to create the dockerfile.
FROM centos:8.3.2011
ARG KERNEL_VER=4.18.0-305.0.1.el8.x86_64
ARG KERNEL_SERVER_URL="http://168.63.67.169/centos/8-stream/BaseOS/x86_64/os/"
LABEL description="centos8.3 docker image for driver loading inside docker container"
LABEL kernel="${KERNEL_VER}"
ENV LC_CTYPE=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV KVERSION="${KERNEL_VER}"
RUN dnf install -y epel-release && \
dnf groupinstall -y "Development Tools" && \
dnf install -y --enablerepo=powertools \
curl \
wget \
sudo \
libarchive \
lapack-devel \
blas-devel \
rsync \
expect \
rpmdevtools \
ncurses-devel \
pinentry \
chrpath \
boost \
cmake \
clang \
yum-utils \
redhat-lsb-core \
mlocate \
nfs-utils \
boost-program-options \
boost-devel \
boost-filesystem \
boost-static \
cpp \
patch \
ncurses* \
curl-devel \
libarchive \
perl-ExtUtils-MakeMaker \
zlib-devel \
libjpeg-devel \
libxcrypt-static \
glibc-static \
lsof \
vim-common \
pciutils
RUN dnf clean all && \
rm -rf /var/cache/dnf
RUN cd /tmp/ && \
mkdir -p /lib/firmware/habanalabs/gaudi && \
wget "${KERNEL_SERVER_URL}/kernel-${KERNEL_VER}.rpm" && \
wget "${KERNEL_SERVER_URL}/kernel-core-${KERNEL_VER}.rpm" && \
wget "${KERNEL_SERVER_URL}/kernel-devel-${KERNEL_VER}.rpm" && \
wget "${KERNEL_SERVER_URL}/kernel-headers-${KERNEL_VER}.rpm" && \
wget "${KERNEL_SERVER_URL}/kernel-modules-${KERNEL_VER}.rpm" && \
dnf install -y /tmp/kernel-* && \
rm -rf /tmp/kernel-*
RUN ln -s "/lib/modules/${KERNEL_VER}" /lib/modules/4.18.0-305.0.1.el8 && \
ln -s /usr/bin/lsof /usr/sbin/lsof
# Set the right timezone, stil need to overmount /etc/localtime from host
RUN echo "Asia/Jerusalem" > /etc/timezone && \
sed -i s/Defaults/#Defaults/g /etc/sudoers && \
echo 'coreos ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers && \
mkdir -p /etc/udev/rules.d/ && \
echo 'KERNEL=="hl[0-9sv]*", MODE="0666"' >> /etc/udev/rules.d/habana.rules
CMD [ "/bin/bash" ]
Build and run docker container:
docker build -f Dockerfile -t hl-core-os:centos8.3 .
Once build is finished, upload the image to your local docker registry. Example:
docker login <your docker registry URL>
docker push <your docker registry URL/path (where your docker is stored) images/hl-core-os:centos8>
Note
Make sure your OCP host has access to above docker registry.
Run docker container with the below command:
sudo podman run –name hl-rhel-coreos --entrypoint=bash --privileged=true -it <your docker registry URL/path_where_your_docker_image_stored/hl-core-os:centos8>
After running the above command the docker prompt CLI will open, You are now inside the running docker container.
Load habanalabs Driver Inside Running Docker Container¶
To prepare for loading the driver, make sure you have two terminal windows open, one for OCP host and one inside the running container. Follow the below steps:
Install the dkms utility:
cd /tmp
yum install -y dkms
cd /lib/modules
ln -s /lib/modules/4.18.0-305.0.1.el8 4.18.0-305.28.1.el8_4.x86_64
cd -
Download and install packages from the vault:
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-1.4.1-11.el8.noarch.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-firmware-1.4.1-11.el8.x86_64.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-firmware-tools-1.4.1-11.el8.x86_64.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanalabs-thunk-1.4.1-11.el8.x86_64.rpm
wget https://vault.habana.ai/artifactory/centos/8/8.3/habanatools-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanalabs-firmware-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanalabs-thunk-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanalabs-firmware-tools-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanatools-1.4.1-11.el8.x86_64.rpm
rpm -ivh ./habanalabs-1.4.1-11.el8.noarch.rpm
Go to your OCP host and copy the above FW files from your running container to a proper place on OCP host.
Obtain Docker ID of the running
hl-rhel-coreos
container:
sudo podman ps -ap
Copy FW files (3e297ed24b36 is a Docker container ID):
sudo podman cp 3e297ed24b36:/lib/firmware/habanalabs/gaudi/gaudi-boot-fit.itb /opt/habana/habanalabs/gaudi/
sudo podman cp 3e297ed24b36:/lib/firmware/habanalabs/gaudi/gaudi-fit.itb /opt/habana/habanalabs/gaudi/
sudo podman cp 3e297ed24b36:/lib/firmware/habanalabs/gaudi/gaudi_tpc.bin /opt/habana/habanalabs/gaudi/
sudo podman cp 3e297ed24b36:/usr/bin/hl-smi /opt/habana/habanalabs/gaudi/
Make sure the overlay mount set up in Preparation For Running Docker Image on OCP-based Host is working properly:
ls -lh /lib/firmware/habanalabs/gaudi/
-rwxr-xr-x. 1 root root 726K Dec 15 04:04 gaudi-boot-fit.itb
-rwxr-xr-x. 1 root root 9.9M Dec 15 04:04 gaudi-fit.itb
-rwxr-xr-x. 1 root root 1.5K Dec 15 04:02 gaudi_tpc.bin
-rwxr-xr-x. 1 root root 2.3M Dec 15 04:06 hl-smi
Return to the running docker container:
podman exec -it 3e297ed24b36 /bin/bash
Load kernel object:
insmod /lib/modules/4.18.0-305.28.1.el8_4.x86_64/extra/habanalabs.ko.xz
Check that driver is loaded properly inside docker container:
lsmod | grep habanalabs
habanalabs 966656 0
habanalabs_en 32768 1 habanalabs
Return to the OCP host and check that driver is loaded properly:
/opt/habana/habanalabs/gaudi/hl-smi -d PRODUCT CLOCK -L
================ HL-SMI LOG ================
Timestamp : Tue Jan 25 17:57:58 UTC 2022
Driver Version : 1.4.1-124dd38
HL-SMI Version : hl-1.4.1-fw-32.5.0.0 (Dec 15 2021 - 05:14:10)
Attached AIPs : 1
[0] AIP (hl0) 0000:03:00.0
Product Name : HL-200
Model Number : F08GL0AI2000A
Serial Number : AK30011368
Module ID : N/A
PCB Assembly Version : V0A
PCB Version : R0E
HL Revision : 2
AIP UUID : 00P3-HL2000B0-14-P63M75-01-01-09
Firmware [FIT] Version : Linux gaudi 5.10.18-hl-gaudi-1.4.1-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
Firmware [SPI] Version : BTL version 2f4e4ab7,Preboot version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:21)
Firmware [UBOOT] Version : U-Boot 2021.04-hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:54:16 +0200) build#: 1564
grep -E "$" /sys/class/habanalabs/hl?/*ver | cut -d / -f5-
hl0/armcp_kernel_ver:Linux gaudi 5.10.18-hl-gaudi-1.4.1-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
hl0/armcp_ver:armcpd version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:56:46)
hl0/cpld_ver:0x0000000f
hl0/cpucp_kernel_ver:Linux gaudi 5.10.18-hl-gaudi-1.4.1-fw-32.5.0-sec-4 #1 SMP PREEMPT Mon Nov 8 09:54:45 IST 2021 aarch64 GNU/Linux
hl0/cpucp_ver:armcpd version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:56:46)
hl0/driver_ver:1.4.1-124dd38
hl0/fuse_ver:00P3-HL2000B0-14-P63M75-01-01-09
hl0/infineon_ver:0x0002
hl0/preboot_btl_ver:BTL version 2f4e4ab7
hl0/preboot_btl_ver:Preboot version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:21)
hl0/thermal_ver:thermald version hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:57:31)
hl0/uboot_ver:U-Boot 2021.04-hl-gaudi-1.4.1-fw-32.5.0-sec-4 (Nov 08 2021 - 09:54:16 +0200) build#: 1564
Habana Device Plugin for Kubernetes¶
For details on habana device plugin for Kubernetes, refer to Habana Device Plugin for Kubernetes.
For deployment of the device plugin, the associated .yaml
file can be used to setup the environment:
oc create -f https://vault.habana.ai/artifactory/docker-k8s-device-plugin/habana-k8s-device-plugin.yaml
You must assign proper permissions to habana-system namespace for the pod to run in habana-system namespace.
oc adm policy add-scc-to-user privileged -z default -n habana-system
Expected output:
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR
AGE
habanalabs-device-plugin-daemonset-gaudi 1 1 1 1 1 <none> 155m
Run Demo Tests¶
To run demo tests, refer to the ResNet50 Keras Model Reference GitHub page.