Intel Gaudi Software Stack and Driver Installation

The following table outlines the supported installation options and the steps required.

Objective

Steps

Run PyTorch on Bare Metal Fresh OS

  1. Install Intel Gaudi SW Stack

  2. Install PyTorch

  3. Set up Python for Models

  4. Run models using Intel Gaudi Model References GitHub Repository

Run Using Containers on Bare Metal Fresh OS

  1. Install Intel Gaudi SW stack

  2. Set up Container Usage

  3. Pull Prebuilt Containers or Build Docker Images from Intel Gaudi Dockerfiles

  4. Set up Python for Models

  5. Run models using Intel Gaudi Model References GitHub repository

Note

Before installing the below packages and Dockers, make sure to review the currently supported versions and operating systems listed in the Support Matrix.

Run PyTorch on Bare Metal Fresh OS

Set Up Intel Gaudi Software Stack

Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel® Gaudi® software package (apt get, yum install or pip install etc.). The installation contains the following installers:

  • habanalabs-graph – installs the graph compiler and the runtime.

  • habanalabs-thunk – installs the thunk library.

  • habanalabs-dkms – installs the habanalabs, habanalabs_cn, habanalabs_en and habanalabs_ib drivers. The habanalabs_ib driver is supported on Gaudi 2 only.

  • habanalabs-rdma-core - installs IBVerbs libraries which provide Intel Gaudi’s libhlib along with libibverbs. The habanalabs-rdma-core package is supported on Gaudi 2 only.

  • habanalabs-firmware - installs the Gaudi firmware.

  • habanalabs-firmware-tools – installs various firmware tools (hlml, hl-smi, etc).

  • habanalabs-qual – installs the qualification application package.

  • habanalabs-container-runtime - installs the habanalabs-container-runtime library.

To install the Intel Gaudi SW stack, perform the following:

  1. Verify the current Intel Gaudi SW version by running the hl-smi tool. Running the installer or Docker image requires the correct SW version. Use the HL-SMI Version at the top of the output. For example, if the installed version is 1.15.1, the output should be as follows:

     HL-SMI Version:       hl-1.15.1-XXXXXXX
     Driver Version:       1.15.1-XXXXXX
    
  2. Install the Intel Gaudi SW stack by running the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.16.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

Note

  • The installation sets the number of huge pages automatically.

  • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

  • This script supports fresh installations only. SW upgrades are not supported.

For further instructions on how to control the script attributes, refer to the help guide by running the following command:

./habanalabs-installer.sh --help

Bring up Network Interfaces

Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded.

A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh.

Run the following commands to bring up the interfaces:

# manage_network_ifs.sh requires ethtool
sudo apt-get install ethtool
./manage_network_ifs.sh --up

Install PyTorch

This section describes how to obtain and install the PyTorch software package. Follow the instructions below to install PyTorch packages on a bare metal platform or virtual machine.

Note

Installing PyTorch with Docker is the recommended installation method and does not require additional steps. For further details, refer to Pull and Launch Docker Image - Intel Gaudi Vault section.

Intel Gaudi PyTorch package consists of:

  • torch - PyTorch framework package with Gaudi support.

  • habana-torch-plugin - Libraries and modules needed to execute PyTorch on single card, single-server and multi-server setup.

  • habana-torch-dataloader - Gaudi multi-threaded dataloader package.

  • torchvision and torchaudio - Torchvision and Torchaudio packages compiled in torch environment. No Gaudi-specific changes in this package.

  • habana-gpu-migration - The library for the GPU Migration toolkit.

  • torch-tb-profiler - The Tensorboard plugin used to display Gaudi-specific information on TensorBoard.

  • habana_quantization_toolkit - Provides model measurement and quantization capabilities in PyTorch models with Gaudi 2.

To install the Intel Gaudi PyTorch environment, run the following command:

./habanalabs-installer.sh install -t dependencies
  ./habanalabs-installer.sh install --type pytorch --venv

Note

  • Installing dependencies requires sudo permission.

  • Verify that PyTorch is already installed in the path listed in the PYTHONPATH environment variable. If it is, uninstall it before proceeding or remove the path from the PYTHONPATH.

The -- venv flag installs PyTorch inside the virtual environment. The default virtual environment folder is $HOME/habanalabs-venv. To override the default, run the following command:

export HABANALABS_VIRTUAL_DIR=xxxx

Model References Requirements

Some PyTorch models need additional Python packages. They can be installed using Python requirements files provided in Model References repository. Refer to Model References repository for detailed instructions on running PyTorch models.

If you want to resume the system level installation, refer to Environment Variables and Configurations Update.

Run Using Containers on Bare Metal Fresh OS

Set up Intel Gaudi Software Stack

Package Retrieval:

  1. Download and install the public key:

    curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
    
  2. Get the name of the operating system:

    lsb_release -c | awk '{print $2}'
    
  3. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  4. Update Debian cache:

    sudo dpkg --configure -a
    
    sudo apt-get update
    

Firmware Installation:

To install the FW, run the following:

sudo apt install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs the habanalabs, habanalabs_cn, habanalabs_en (Ethernet) and habanalabs_ib drivers. If automation scripts are used, the scripts must be modified to load/unload the drivers.

Note

habanalabs_ib driver is available on Gaudi 2 only.

  1. Run the below command to install all drivers:

    sudo apt install -y habanalabs-dkms
    
  2. Unload the drivers in this order - habanalabs, habanalabs_cn, habanalabs_en and habanalabs_ib:

    sudo modprobe -r <driver name>
    
  3. Load the drivers in this order - habanalabs_en and habanalabs_ib, habanalabs_cn, habanalabs:

    sudo modprobe <driver name>
    

Note

Amazon Linux 2 installation is available on first-gen Gaudi only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/AmazonLinux2
    
    enabled=1
    
    gpgcheck=0
    
    gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

Firmware Installation:

To install the FW, run the following:

sudo yum install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs the habanalabs, habanalabs_cn and habanalabs_en (Ethernet) drivers. If automation scripts are used, the scripts must be modified to load/unload the drivers.

  1. Run the below command to install all drivers:

    sudo yum install -y habanalabs
    
  2. Unload the drivers in this order - habanalabs, habanalabs_cn, habanalabs_en:

    sudo modprobe -r <driver name>
    
  3. Load the drivers in this order - habanalabs_en, habanalabs_cn, habanalabs:

    sudo modprobe <driver name>
    

Note

RHEL8.6 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/rhel/8/8.6
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

This will search for and list all packages with the word Habana.

  1. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Firmware Installation:

To install the FW, run the following:

sudo yum install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs the habanalabs, habanalabs_cn, habanalabs_en (Ethernet) and habanalabs_ib drivers. If automation scripts are used, the scripts must be modified to load/unload the drivers.

  1. Run the below command to install all drivers:

    sudo yum install -y habanalabs
    
  2. Unload the drivers in this order - habanalabs, habanalabs_cn, habanalabs_en and habanalabs_ib:

    sudo modprobe -r <driver name>
    
  3. Load the drivers in this order - habanalabs_en and habanalabs_ib, habanalabs_cn, habanalabs:

    sudo modprobe <driver name>
    

Note

RHEL9.2 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/rhel/9/9.2
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

  4. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Firmware Installation:

To install the FW, run the following:

sudo yum install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs the habanalabs, habanalabs_cn, habanalabs_en (Ethernet) and habanalabs_ib drivers. If automation scripts are used, the scripts must be modified to load/unload the drivers.

  1. Run the below command to install all drivers:

    sudo yum install -y habanalabs
    
  2. Unload the drivers in this order - habanalabs, habanalabs_cn, habanalabs_en and habanalabs_ib:

    sudo modprobe -r <driver name>
    
  3. Load the drivers in this order - habanalabs_en and habanalabs_ib, habanalabs_cn, habanalabs:

    sudo modprobe <driver name>
    

Note

Debian 10.10 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Download and install the public key:

    curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
    
  2. Get the name of the operating system:

    lsb_release -c | awk '{print $2}'
    
  3. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  4. Update Debian cache:

    sudo dpkg --configure -a
    
    sudo apt-get update
    

Firmware Installation:

To install the FW, run the following:

sudo apt install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs the habanalabs, habanalabs_cn, habanalabs_en (Ethernet) and habanalabs_ib drivers. If automation scripts are used, the scripts must be modified to load/unload the drivers.

Note

habanalabs_ib driver is available on Gaudi 2 only.

  1. Run the below command to install all drivers:

    sudo apt install -y habanalabs-dkms
    
  2. Unload the drivers in this order - habanalabs, habanalabs_cn, habanalabs_en and habanalabs_ib:

    sudo modprobe -r <driver name>
    
  3. Load the drivers in this order - habanalabs_en and habanalabs_ib, habanalabs_cn, habanalabs:

    sudo modprobe <driver name>
    

Note

TencentOS 3.1 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/tencentos/3/3.1
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

  4. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Firmware Installation:

To install the FW, run the following:

sudo yum install -y habanalabs-firmware

Driver Installation:

The habanalabs-dkms_all package installs the habanalabs, habanalabs_cn, habanalabs_en (Ethernet) and habanalabs_ib drivers. If automation scripts are used, the scripts must be modified to load/unload the drivers.

  1. Run the below command to install all drivers:

    sudo yum install -y habanalabs
    
  2. Unload the drivers in this order - habanalabs, habanalabs_cn, habanalabs_en and habanalabs_ib:

    sudo modprobe -r <driver name>
    
  3. Load the drivers in this order - habanalabs_en and habanalabs_ib, habanalabs_cn, habanalabs:

    sudo modprobe <driver name>
    

Set up Container Usage

To run containers, make sure to install and set up habanalabs-container-runtime as detailed in the below sections.

Install Container Runtime

The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. The habanalabs-container-runtime can support both Docker and Kubernetes.

Note

Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

  • #visible_devices_all_as_default = false

  • #mount_accelerators = false

Package Retrieval:

  1. Download and install the public key:

    curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
    
  2. Get the name of the operating system:

    lsb_release -c | awk '{print $2}'
    
  3. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  4. Update Debian cache:

    sudo dpkg --configure -a
    
    sudo apt-get update
    

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo apt install -y habanalabs-container-runtime

Note

Amazon Linux 2 installation is available on first-gen Gaudi only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/AmazonLinux2
    
    enabled=1
    
    gpgcheck=0
    
    gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Note

RHEL8.6 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/rhel/8/8.6
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

This will search for and list all packages with the word Habana.

  1. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Note

RHEL9.2 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/rhel/9/9.2
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

  4. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Note

Debian 10.10 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Download and install the public key:

    curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
    
  2. Get the name of the operating system:

    lsb_release -c | awk '{print $2}'
    
  3. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  4. Update Debian cache:

    sudo dpkg --configure -a
    
    sudo apt-get update
    

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo apt install -y habanalabs-container-runtime

Note

TencentOS 3.1 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/tencentos/3/3.1
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

  4. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Set up Container Runtime

To register the habana runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration.

Note

As of Kubernetes 1.20, support for Docker has been deprecated.

  1. Register habana runtime by adding the following to /etc/docker/daemon.json:

    sudo tee /etc/docker/daemon.json <<EOF
    {
       "runtimes": {
          "habana": {
                "path": "/usr/bin/habana-container-runtime",
                "runtimeArgs": []
          }
       }
    }
    EOF
    
  2. (Optional) Reconfigure the default runtime by adding the following to /etc/docker/daemon.json:

    "default-runtime": "habana"
    

    Your code should look similar to this:

    {
       "default-runtime": "habana",
       "runtimes": {
          "habana": {
             "path": "/usr/bin/habana-container-runtime",
             "runtimeArgs": []
          }
       }
    }
    
  3. Restart Docker:

    sudo systemctl restart docker
    

If a host machine has eight Gaudi devices, you can mount all using the environment variable HABANA_VISIBLE_DEVICES=all. The below shows the usage example:

docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all {docker image} /bin/bash -c "ls /dev/ac*"
accel0
accel1
accel2
accel3
accel4
accel5
accel6
accel7
accel_controlD0
accel_controlD1
accel_controlD2
accel_controlD3
accel_controlD4
accel_controlD5
accel_controlD6
accel_controlD7

This variable controls which Gaudi devices will be made accessible inside the container. Possible values:

  • 0,1,2 … - A comma-separated list of index(es).

  • all - All Gaudi devices are accessible. This is the default value.

  1. Register habana runtime:

    sudo tee /etc/containerd/config.toml <<EOF
    disabled_plugins = []
    version = 2
    
    [plugins]
      [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "habana"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana]
              runtime_type = "io.containerd.runc.v2"
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options]
                BinaryName = "/usr/bin/habana-container-runtime"
      [plugins."io.containerd.runtime.v1.linux"]
        runtime = "habana-container-runtime"
    EOF
    
  2. Restart containerd:

    sudo systemctl restart containerd
    
  1. Create a new configuration file at /etc/crio/crio.conf.d/99-habana-ai.conf:

    [crio.runtime]
    default_runtime = "habana-ai"
    
    [crio.runtime.runtimes.habana-ai]
    runtime_path = "/usr/local/habana/bin/habana-container-runtime"
    monitor_env = [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
    ]
    
  1. Restart CRI-O service:

    systemctl restart crio.service
    

Pull Prebuilt Containers

Prebuilt containers are provided in:

  • Intel Gaudi vault

  • Amazon ECR Public Library

  • AWS Deep Learning Containers (DLC)

Pull and Launch Docker Image - Intel Gaudi Vault

Note

Before running Docker, make sure to map the dataset as detailed in Map Dataset to Docker.

Use the below commands to pull and run Dockers. Make sure to update the below command with the required operating system. See the Support Matrix for a list of supported operating systems:

docker pull vault.habana.ai/gaudi-docker/1.16.0/{$OS}/habanalabs/pytorch-installer-2.2.2:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.0/{$OS}/habanalabs/pytorch-installer-2.2.2:latest

Note

  • Include –ipc=host in the Docker run command for the Docker images. This is required for distributed training using the Habana Collective Communication Library (HCCL); allowing re-use of host shared memory for best performance.

  • To run the Docker image with a partial number of the supplied Gaudi devices, make sure to set the device to module mapping correctly. See Multiple Dockers Each with a Single Workload for further details.

AWS Deep Learning Containers

To set up and use AWS Deep Learning containers, follow the instructions detailed in AWS Available Deep Learning Containers Images.

Build Docker Images from Intel Gaudi Dockerfiles

  1. Download Dockerfiles and build script from the Setup and Install Repo to a local directory.

  2. Run the build script to generate a Docker image:

    ./docker_build.sh mode [pytorch] os [ubuntu22.04,amzn2,rhel8.6,rhel9.2,tencentos3.1] framework_version
    

    For example:

    ./docker_build.sh pytorch ubuntu22.04 2.2.2

Launch Docker Image

Note

Before running Docker, make sure to map the dataset as detailed in Map Dataset to Docker.

Use the below commands to launch the Docker image. Make sure to update the below command with the required operating system. See the Support Matrix for a list of supported operating systems.

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.16.0/${OS}/habanalabs/pytorch-installer-2.2.2:latest

Map Dataset to Docker

Make sure to download the dataset prior to running Docker and mount the location of your dataset to the Docker by adding the below flag. For example, host dataset location /opt/datasets/imagenet mounts to /datasets/imagenet inside the Docker:

-v /opt/datasets/imagenet:/datasets/imagenet

Note

OPTIONAL: Add the following flag to mount a local host share folder to the Docker in order to transfer files from Docker:

-v $HOME/shared:/root/shared

Set up Python for Models

Using your own models requires setting Python 3.8 as the default Python version. If Python 3.8 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.8

Running models from Intel Gaudi Model References GitHub repository, requires the PYTHON environment variable to match the supported Python release:

export PYTHON=/usr/bin/<python version>

Note

The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

If you want to resume the System level installation, see Environment Variables and Configurations Update.