Bare Metal Fresh OS Installation

The following table outlines the supported installation options and the steps required.

Objective

Steps

Run Framework on Bare Metal Fresh OS (TensorFlow/PyTorch)

  1. Install SynapseAI SW stack

  2. Install framework

  3. Set up Python for Models

  4. Run models using Habana Model-References

Run Using Containers on Bare Metal Fresh OS

  1. Install SynapseAI SW stack

  2. Set up Container Usage

  3. Pull Prebuilt Containers or Build Docker Images from Habana Dockerfiles

  4. Set up Python for Models

  5. Run models using Habana Model-References

Set Up SynapseAI SW Stack

Installing the package with internet connection available allows the network to download and install the required dependencies for the SynapseAI package (apt get, yum install or pip install etc.). The installation contains the following Installers:

  • habanalabs-graph – installs the Graph Compiler and the run-time.

  • habanalabs-thunk – installs the thunk library.

  • habanalabs-dkms – installs the PCIe driver.

  • habanalabs-firmware - installs the Gaudi Firmware.

  • habanalabs-firmware-tools – installs various Firmware tools (hlml, hl-smi, etc).

  • habanalabs-qual – installs the qualification application package. See Qualification Library.

  • habanalabs-container-runtime - installs the container runtime library.

To install SynapseAI SW stack, run the following command:

wget -nv https://vault.habana.ai/artifactory/gaudi-installer/latest/habanalabs-installer.sh
chmod +x habanalabs-installer.sh
./habanalabs-installer.sh install --type base

For further instructions on how to control the script attributes, refer to the help guide by running the following command:

./habanalabs-installer.sh --help

Note

  • Running the package installs the latest version.

  • This script works only for currently supported Operating Systems specified in Support Matrix.

  • The installation sets the number of huge pages automatically.

  • To install each installer separately, refer to the detailed instructions in Installing SynapseAI SW Packages Individually.

Bring up Network Interfaces

If training using Gaudi network interfaces for multi-node scaleout (external Gaudi network interfaces between servers), please ensure the network interfaces are brought up. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded.

Note

This section is not relevant for AWS users.

A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh script as detailed in manage_network_ifssh.

Use the following commands:

# manage_network_ifs.sh requires ethtool
sudo apt-get install ethtool
./manage_network_ifs.sh --up

Habana Driver Unattended Upgrade

Unattended upgrade automatically installs the latest Habana drivers (habanalabs and habanalabs_en).

Note

Unattended upgrade is supported starting from v1.3.0 and above only.

  1. Install unattended upgrade:

sudo apt install --only-upgrade habanalabs-dkms

After running unattended upgrade, you must load/unload the drivers or restart your machine. The habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

  1. Unload the habanalabs driver first and the habanalabs_en driver after:

sudo modprobe -r <driver name>
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

Unattended upgrade automatically installs the latest Habana drivers (habanalabs and habanalabs_en).

Note

Unattended upgrade is supported starting from v1.3.0 and above only.

  1. Install unattended upgrade:

sudo apt install --only-upgrade habanalabs-dkms

After running unattended upgrade, you must load/unload the drivers or restart your machine. The habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

  1. Unload the habanalabs driver first and the habanalabs_en driver after:

sudo modprobe -r <driver name>
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

Unattended upgrade automatically installs the latest Habana drivers (habanalabs and habanalabs_en).

Note

Unattended upgrade is supported starting from v1.3.0 and above only.

  1. Install unattended upgrade:

sudo yum update habanalabs

After running unattended upgrade, you must load/unload the drivers or restart your machine. The habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

  1. Unload the habanalabs driver first and the habanalabs_en driver after:

sudo modprobe -r <driver name>
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

Unattended upgrade automatically installs the latest Habana drivers (habanalabs and habanalabs_en).

Note

Unattended upgrade is supported starting from v1.3.0 and above only.

  1. Install unattended upgrade:

sudo yum update habanalabs

After running unattended upgrade, you must load/unload the drivers or restart your machine. The habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

  1. Unload the habanalabs driver first and the habanalabs_en driver after:

sudo modprobe -r <driver name>
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

Install Native Frameworks

Installing frameworks with docker is the recommended installation method and does not require additional steps.

TensorFlow Installation

This section describes how to obtain and install the TensorFlow software package. Follow these instructions if you want to install the TensorFlow packages on a Bare Metal platform without a Docker image. The package consists of two main components to guarantee the same functionality delivered with TensorFlow Docker:

  • Base habana-tensorflow Python package - Libraries and modules needed to execute TensorFlow on a single Gaudi device.

  • Scale-out habana-horovod Python package - Libraries and modules needed to execute TensorFlow on a single-node machine.

To install Habana TensorFlow, run the following command.

./habanalabs-installer.sh install --type tensorflow --venv

Note

  • Running the above command installs the latest version.

  • This script works only for currently supported Operating Systems specified in Support Matrix.

The -- venv flag installs the relevant framework inside the virtual environment. The default virtual environment folder is $HOME/habanalabs-venv. To override the default, run the following command:

export HABANALABS_VIRTUAL_DIR=xxxx

Model References Requirements

Habana provides a number of model references optimized to run on Gaudi. Those models are available at Model-References page.

Many of the references require additional Python packages (installed with pip tools), not provided by Habana. The packages required to run topologies from Model References repository are defined in per-topology requirements.txt files in each folder containing the topologies’ scripts.

PyTorch Installation

This section describes how to obtain and install the PyTorch software package. Follow the instructions outlined below to install PyTorch packages on a bare metal platform or virtual machine without a Docker image.

Habana PyTorch packages consist of:

  • torch - PyTorch framework package with Habana support

  • habana-torch-plugin - Libraries and modules needed to execute PyTorch on single card, single node and multi node setup.

  • habana-torch-dataloader - Habana multi-threaded dataloader package.

  • torchvision - Torchvision package compiled in torch environment. No Habana specific changes in this package.

To install Habana PyTorch environment, run the following command.

./habanalabs-installer.sh install --type pytorch --venv

Note

  • Running the above command installs the latest version.

  • This script works only for currently supported Operating Systems specified in Support Matrix.

The -- venv flag installs the relevant framework inside the virtual environment. The default virtual environment folder is $HOME/habanalabs-venv. To override the default, run the following command:

export HABANALABS_VIRTUAL_DIR=xxxx

Model References Requirements

Some PyTorch models need additional python packages. They can be installed using python requirements files provided in Model References repository. Refer to Model References repository for detailed instructions on running PyTorch models.

Run Using Containers

Set up SynapseAI SW Stack

Package Retrieval:

  1. Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
  1. Get the name of the operating system:

lsb_release -c | awk '{print $2}'
  1. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  2. Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Firmware Installation:

Install the Firmware:

sudo apt install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs both the habanalabs and habanalabs_en (Ethernet) drivers. If automation scripts are used, the scripts must be modified to load/unload both drivers.

On kernels 5.12 and later, you can load/unload the two drivers in no specific order. On kernels below 5.12, the habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

  1. Run the below command to install both the habanalabs and habanalabs_en driver:

sudo apt install -y habanalabs-dkms
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

You can enable unattended upgrade to automatically install the latest Habana drivers. See Habana Driver Unattended Upgrade.

Package Retrieval:

  1. Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
  1. Get the name of the operating system:

lsb_release -c | awk '{print $2}'
  1. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  2. Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Firmware Installation:

Install the Firmware:

sudo apt install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs both the habanalabs and habanalabs_en (Ethernet) drivers. If automation scripts are used, the scripts must be modified to load/unload both drivers.

On kernels 5.12 and later, you can load/unload the two drivers in no specific order. On kernels below 5.12, the habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

  1. Run the below command to install both the habanalabs and habanalabs_en driver:

sudo apt install -y habanalabs-dkms
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

You can enable unattended upgrade to automatically install the latest Habana drivers. See Habana Driver Unattended Upgrade.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

[vault]

name=Habana Vault

baseurl=https://vault.habana.ai/artifactory/AmazonLinux2

enabled=1

gpgcheck=0

gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key

repo_gpgcheck=0
  1. Update YUM cache by running the following command:

sudo yum makecache
  1. Verify correct binding by running the following command:

yum search habana

This will search for and list all packages with the word Habana.

Firmware Installation:

Install the Firmware:

sudo yum install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs both the habanalabs and habanalabs_en (Ethernet) drivers. If automation scripts are used, the scripts must be modified to load/unload both drivers.

On kernels 5.12 and later, you can load/unload the two drivers in no specific order. On kernels below 5.12, the habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

The below commands installs/uninstalls both the habanalabs and habanalabs_en driver.

  1. (Recommended) Remove the previous driver package:

sudo yum remove habanalabs*
  1. Install the driver:

sudo yum install -y habanalabs
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

[vault]

name=Habana Vault

baseurl=https://vault.habana.ai/artifactory/rhel/8/8.6

enabled=1

repo_gpgcheck=0
  1. Update YUM cache by running the following command:

sudo yum makecache
  1. Verify correct binding by running the following command:

yum search habana

This will search for and list all packages with the word Habana.

  1. Reinstall libarchive package by following command:

sudo dnf install -y libarchive*

Firmware Installation:

Install the Firmware:

sudo yum install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs both the habanalabs and habanalabs_en (Ethernet) drivers. If automation scripts are used, the scripts must be modified to load/unload both drivers.

On kernels 5.12 and later, you can load/unload the two drivers in no specific order. On kernels below 5.12, the habanalabs_en driver must be loaded before the habanalabs driver and unloaded after the habanalabs driver.

The below commands installs/uninstalls both the habanalabs and habanalabs_en driver.

  1. (Recommended) Remove the previous driver package:

sudo yum remove habanalabs*
  1. Install the driver:

sudo yum install -y habanalabs
  1. Load the habanalabs_en driver first and the habanalabs driver after:

sudo modprobe <driver name>

Set up Container Usage

To run containers, make sure to install and set up container runtime as detailed in the below sections.

Install Container Runtime

The container runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. The container runtime can support both docker and Kubernetes.

Package Retrieval:

  1. Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
  1. Get the name of the operating system:

lsb_release -c | awk '{print $2}'
  1. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  2. Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo apt install -y habanalabs-container-runtime

Package Retrieval:

  1. Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
  1. Get the name of the operating system:

lsb_release -c | awk '{print $2}'
  1. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  2. Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo apt install -y habanalabs-container-runtime

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

[vault]

name=Habana Vault

baseurl=https://vault.habana.ai/artifactory/AmazonLinux2

enabled=1

gpgcheck=0

gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key

repo_gpgcheck=0
  1. Update YUM cache by running the following command:

sudo yum makecache
  1. Verify correct binding by running the following command:

yum search habana

This will search for and list all packages with the word Habana.

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

[vault]

name=Habana Vault

baseurl=https://vault.habana.ai/artifactory/rhel/8/8.6

enabled=1

repo_gpgcheck=0
  1. Update YUM cache by running the following command:

sudo yum makecache
  1. Verify correct binding by running the following command:

yum search habana

This will search for and list all packages with the word Habana.

  1. Reinstall libarchive package by following command:

sudo dnf install -y libarchive*

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Set up Container Runtime

To register the habana runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration.

Note

As of Kubernetes 1.20 support for docker has been deprecated.

  1. Register Habana runtime by adding the following to /etc/docker/daemon.json:

    sudo tee /etc/docker/daemon.json <<EOF
    {
       "runtimes": {
          "habana": {
                "path": "/usr/bin/habana-container-runtime",
                "runtimeArgs": []
          }
       }
    }
    EOF
    
  2. (Optional) For Kubernetes, reconfigure the default runtime by adding the following to /etc/docker/daemon.json:

"default-runtime": "habana"

It will look similar to this:

{
   "default-runtime": "habana",
   "runtimes": {
      "habana": {
         "path": "/usr/bin/habana-container-runtime",
         "runtimeArgs": []
      }
   }
}
  1. Restart Docker:

sudo systemctl restart docker

If a host machine has eight Habana devices, you can mount all using the environment variable HABANA_VISIBLE_DEVICES=all. The below shows the usage example:

docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all {docker image} /bin/bash -c "ls /dev/hl*"
/dev/hl0
/dev/hl1
/dev/hl2
/dev/hl3
/dev/hl4
/dev/hl5
/dev/hl6
/dev/hl7
/dev/hl_controlD0
/dev/hl_controlD1
/dev/hl_controlD2
/dev/hl_controlD3
/dev/hl_controlD4
/dev/hl_controlD5
/dev/hl_controlD6
/dev/hl_controlD7

This variable controls which Habana devices will be made accessible inside the container. Possible values:

  • 0,1,2 … - A comma-separated list of index(es).

  • all - All Habana devices will be accessible. This is the default value.

  1. Register Habana runtime:

sudo tee /etc/containerd/config.toml <<EOF
disabled_plugins = []
version = 2

   [plugins]
   [plugins."io.containerd.grpc.v1.cri"]
      [plugins."io.containerd.grpc.v1.cri".containerd]
         default_runtime_name = "habana"
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana]
            runtime_type = "io.containerd.runc.v2"
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options]
               BinaryName = "/usr/bin/habana-container-runtime"
   [plugins."io.containerd.runtime.v1.linux"]
      runtime = "habana-container-runtime"
EOF
  1. Restart containerd:

bash sudo systemctl restart containerd

Pull Prebuilt Containers

Prebuilt containers are provided in:

  • Habana Vault

  • Amazon ECR Public Library

  • AWS Deep Learning Containers (DLC)

Pull and Launch Docker Image - Habana Vault

Note

Before running docker, make sure to map the dataset as detailed in Map Dataset to Docker.

To pull and run the Habana Docker images use the below code examples. Update the parameters listed in the following table to run the desired configuration.

Parameter

Description

Values

$OS

Operating System of Image

[ubuntu18.04, ubuntu20.04, amzn2, rhel8.6]

$TF_VERSION

Desired TensorFlow Version

[2.10.1, 2.8.4]

$PT_VERSION

PyTorch Version

[1.13.0]

    docker pull vault.habana.ai/gaudi-docker/1.7.1/{$OS}/habanalabs/tensorflow-installer-tf-cpu-${TF_VERSION}:latest
     docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host vault.habana.ai/gaudi-docker/1.7.1/{$OS}/habanalabs/tensorflow-installer-tf-cpu-${TF_VERSION}:latest
     docker pull vault.habana.ai/gaudi-docker/1.7.1/{$OS}/habanalabs/pytorch-installer-1.13.0:latest
     docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.7.1/{$OS}/habanalabs/pytorch-installer-1.13.0:latest

AWS Deep Learning Containers

To set up and use AWS Deep Learning Containers, follow the instructions detailed in AWS Available Deep Learning Containers Images.

Build Docker Images from Habana Dockerfiles

  1. Download Docker files and build script from the Setup and Install Repo to a local directory.

  2. Run the build script to generate a Docker image:

./docker_build.sh mode [tensorflow,pytorch] os [ubuntu18.04,ubuntu20.04,amzn2,rhel8.6] tf_version [{Habana TF Version 1}, {Habana TF Version 2}]

For example:

./docker_build.sh tensorflow ubuntu20.04 2.8.4

Launch Docker Image that was Built

Note

Before running docker, make sure to map the dataset as detailed in Map Dataset to Docker.

Launch the docker image using the below code examples. Update the parameters listed in the following table to run the desired configuration.

Parameter

Description

Values

$OS

Operating System of Image

[ubuntu18.04, ubuntu20.04, amzn2, rhel8.6]

$TF_VERSION

Desired TensorFlow Version

[2.10.1, 2.8.4]

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host vault.habana.ai/gaudi-docker/1.7.1/${OS}/habanalabs/tensorflow-installer-tf-cpu-${TF_VERSION}:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.7.1/${OS}/habanalabs/pytorch-installer-1.13.0:latest

Map Dataset to Docker

Make sure to download the dataset prior to running docker and mount the location of your dataset to the docker by adding the below flag. For example, host dataset location /opt/datasets/imagenet will mount to /datasets/imagenet inside the docker:

-v /opt/datasets/imagenet:/datasets/imagenet

Note

OPTIONAL: Add the following flag to mount a local host share folder to the docker in order to be able to transfer files out of docker:

-v $HOME/shared:/root/shared

Set up Python for Models

Using your own models requires setting python 3.8 as the default python version. If python 3.8 is not the default version, replace any call to the python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.8

Running models from Habana Model-References, requires the PYTHON environment variable to match the supported python release:

export PYTHON=/usr/bin/python3.8

Note

Python 3.8 is the supported python release for all Operating Systems listed in the Support Matrix.