AWS Deep Learning AMI (DLAMI) Installation

When using the AWS DLAMI, the environment is already pre-installed. The image contains the SynapseAI software stack and TensorFlow/PyTorch framework. Loading additional SW or Container images is optional.

Objective

Steps

Use TensorFlow or PyTorch on DLAMI

  1. Set up Python for Models

  2. Run models using Habana Model-References

Run Using Containers on DLAMI (Optional)

  1. Set up Container Usage

  2. Pull Prebuilt Containers or Build Docker Images from Habana Dockerfiles

  3. Run models using Habana Model-References

Note

Before installing the below packages and dockers, make sure to review the currently supported versions and Operating Systems listed in the Support Matrix.

Set up Python for Models

Using your own models requires setting python 3.8 as the default python version. If python 3.8 is not the default version, replace any call to the python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.8

Running models from Habana Model-References, requires the PYTHON environment variable to match the supported python release:

export PYTHON=/usr/bin/<python version>

Note

Python 3.8 is the supported python release for all Operating Systems except for Ubuntu22.04. See the versions listed in the Support Matrix.

Run Using Containers

To run using containers, make sure to Set up Container Usage.

Set up Container Usage

To run containers, make sure to install and set up container runtime as detailed in the below sections.

Install Container Runtime

The container runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. The container runtime can support both docker and Kubernetes.

Package Retrieval:

  1. Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
  1. Get the name of the operating system:

lsb_release -c | awk '{print $2}'
  1. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  2. Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo apt install -y habanalabs-container-runtime

Package Retrieval:

  1. Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
  1. Get the name of the operating system:

lsb_release -c | awk '{print $2}'
  1. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  2. Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo apt install -y habanalabs-container-runtime

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

[vault]

name=Habana Vault

baseurl=https://vault.habana.ai/artifactory/AmazonLinux2

enabled=1

gpgcheck=0

gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key

repo_gpgcheck=0
  1. Update YUM cache by running the following command:

sudo yum makecache
  1. Verify correct binding by running the following command:

yum search habana

This will search for and list all packages with the word Habana.

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

[vault]

name=Habana Vault

baseurl=https://vault.habana.ai/artifactory/rhel/8/8.6

enabled=1

repo_gpgcheck=0
  1. Update YUM cache by running the following command:

sudo yum makecache
  1. Verify correct binding by running the following command:

yum search habana

This will search for and list all packages with the word Habana.

  1. Reinstall libarchive package by following command:

sudo dnf install -y libarchive*

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Set up Container Runtime

To register the habana runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration.

Note

As of Kubernetes 1.20 support for docker has been deprecated.

  1. Register Habana runtime by adding the following to /etc/docker/daemon.json:

    sudo tee /etc/docker/daemon.json <<EOF
    {
       "runtimes": {
          "habana": {
                "path": "/usr/bin/habana-container-runtime",
                "runtimeArgs": []
          }
       }
    }
    EOF
    
  2. (Optional) For Kubernetes, reconfigure the default runtime by adding the following to /etc/docker/daemon.json:

    "default-runtime": "habana"
    

    It will look similar to this:

    {
       "default-runtime": "habana",
       "runtimes": {
          "habana": {
             "path": "/usr/bin/habana-container-runtime",
             "runtimeArgs": []
          }
       }
    }
    
  3. Restart Docker:

    sudo systemctl restart docker
    

If a host machine has eight Habana devices, you can mount all using the environment variable HABANA_VISIBLE_DEVICES=all. The below shows the usage example:

docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all {docker image} /bin/bash -c "ls /dev/hl*"
/dev/hl0
/dev/hl1
/dev/hl2
/dev/hl3
/dev/hl4
/dev/hl5
/dev/hl6
/dev/hl7
/dev/hl_controlD0
/dev/hl_controlD1
/dev/hl_controlD2
/dev/hl_controlD3
/dev/hl_controlD4
/dev/hl_controlD5
/dev/hl_controlD6
/dev/hl_controlD7

This variable controls which Habana devices will be made accessible inside the container. Possible values:

  • 0,1,2 … - A comma-separated list of index(es).

  • all - All Habana devices will be accessible. This is the default value.

  1. Register Habana runtime:

    sudo tee /etc/containerd/config.toml <<EOF
    disabled_plugins = []
    version = 2
    
       [plugins]
       [plugins."io.containerd.grpc.v1.cri"]
          [plugins."io.containerd.grpc.v1.cri".containerd]
             default_runtime_name = "habana"
             [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
             [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana]
                runtime_type = "io.containerd.runc.v2"
                [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options]
                   BinaryName = "/usr/bin/habana-container-runtime"
       [plugins."io.containerd.runtime.v1.linux"]
          runtime = "habana-container-runtime"
    EOF
    
  2. Restart containerd:

    bash sudo systemctl restart containerd
    

Pull Prebuilt Containers

Prebuilt containers are provided in:

  • Habana Vault

  • Amazon ECR Public Library

  • AWS Deep Learning Containers (DLC)

Pull and Launch Docker Image - Habana Vault

Note

Before running docker, make sure to map the dataset as detailed in Map Dataset to Docker.

To pull and run the Habana Docker images use the below code examples. Update the parameters listed in the following table to run the desired configuration.

Parameter

Description

Values

$OS

Operating System of Image

[ubuntu20.04, ubuntu22.04, amzn2, rhel8.6]

$TF_VERSION

Desired TensorFlow Version

[2.12.0]

$PT_VERSION

PyTorch Version

[2.0.1]

Note

Include –ipc=host in the docker run command for PyTorch docker images. This is required for distributed training using the Habana Collective Communication Library (HCCL); allowing re-use of host shared memory for best performance.

    docker pull vault.habana.ai/gaudi-docker/1.10.0/{$OS}/habanalabs/tensorflow-installer-tf-cpu-$2.12.0:latest
     docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host vault.habana.ai/gaudi-docker/1.10.0/{$OS}/habanalabs/tensorflow-installer-tf-cpu-${TF_VERSION}:latest
     docker pull vault.habana.ai/gaudi-docker/1.10.0/{$OS}/habanalabs/pytorch-installer-2.0.1:latest
     docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/{$OS}/habanalabs/pytorch-installer-2.0.1:latest

AWS Deep Learning Containers

To set up and use AWS Deep Learning Containers, follow the instructions detailed in AWS Available Deep Learning Containers Images.

Build Docker Images from Habana Dockerfiles

  1. Download Docker files and build script from the Setup and Install Repo to a local directory.

  2. Run the build script to generate a Docker image:

./docker_build.sh mode [tensorflow,pytorch] os [ubuntu20.04,ubuntu22.04,amzn2,rhel8.6] tf_version

For example:

./docker_build.sh tensorflow ubuntu20.04 2.12.0

Launch Docker Image that was Built

Note

Before running docker, make sure to map the dataset as detailed in Map Dataset to Docker.

Launch the docker image using the below code examples. Update the parameters listed in the following table to run the desired configuration.

Parameter

Description

Values

$OS

Operating System of Image

[ubuntu20.04, ubuntu22.04 amzn2, rhel8.6]

$TF_VERSION

Desired TensorFlow Version

[2.12.0]

$PT_VERSION

Desired PyTorch Version

[2.0.1]

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host vault.habana.ai/gaudi-docker/1.10.0/${OS}/habanalabs/tensorflow-installer-tf-cpu-$2.12.0:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/${OS}/habanalabs/pytorch-installer-2.0.1:latest

Map Dataset to Docker

Make sure to download the dataset prior to running docker and mount the location of your dataset to the docker by adding the below flag. For example, host dataset location /opt/datasets/imagenet will mount to /datasets/imagenet inside the docker:

-v /opt/datasets/imagenet:/datasets/imagenet

Note

OPTIONAL: Add the following flag to mount a local host share folder to the docker in order to be able to transfer files out of docker:

-v $HOME/shared:/root/shared