AWS Deep Learning AMI (DLAMI) Installation¶

When using the AWS DLAMI, the environment is already pre-installed. The image contains the Intel Gaudi software stack and PyTorch framework. Loading additional SW or Container images is optional.

Objective	Steps
Use PyTorch on DLAMI	Set up Python for Models Run models using Intel Gaudi Model References GitHub repository
Run Using Containers on DLAMI (Optional)	Set up Container Usage Pull Prebuilt Containers or Build Docker Images from Intel Gaudi Dockerfiles Run models using Intel Gaudi Model References GitHub repository

Note

Before installing the below packages and dockers, make sure to review the currently supported versions and Operating Systems listed in the Support Matrix.

Set up Python for Models¶

Using your own models requires setting Python 3.8 as the default Python version. If Python 3.8 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.8

Running models from Intel Gaudi Model References GitHub repository, requires the PYTHON environment variable to match the supported Python release:

export PYTHON=/usr/bin/<python version>

Note

Python 3.8 is the supported Python release for all Operating Systems except for Ubuntu22.04. Refer to the Support Matrix for a full list of supported Operating Systems and Python versions.

Run Using Containers¶

To run using containers, make sure to Set up Container Usage.

Set up Container Usage¶

To run containers, make sure to install and set up habanalabs-container-runtime as detailed in the below sections.

Install Container Runtime¶

The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. The habanalabs-container-runtime can support both docker and Kubernetes.

Note

Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

#visible_devices_all_as_default = false
#mount_accelerators = false

Ubuntu 22.04

Package Retrieval:

Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --

Get the name of the operating system:

lsb_release -c | awk '{print $2}'

Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.
Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo apt install -y habanalabs-container-runtime

Amazon Linux 2

Package Retrieval:

Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

[vault]

name=Habana Vault

baseurl=https://vault.habana.ai/artifactory/AmazonLinux2

enabled=1

gpgcheck=0

gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key

repo_gpgcheck=0

Update YUM cache by running the following command:

sudo yum makecache

Verify correct binding by running the following command:

yum search habana

This will search for and list all packages with the word Habana.

Install habanalabs-container-runtime:

Install the habanalabs-container-runtime package:

sudo yum install -y habanalabs-container-runtime

Set up Container Runtime¶

To register the habana runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration.

Docker Engine setup

Note

As of Kubernetes 1.20 support for docker has been deprecated.

sudo tee /etc/docker/daemon.json <<EOF
{
   "runtimes": {
      "habana": {
            "path": "/usr/bin/habana-container-runtime",
            "runtimeArgs": []
      }
   }
}
EOF

(Optional) Reconfigure the default runtime by adding the following to /etc/docker/daemon.json:

"default-runtime": "habana"

It will look similar to this:

{
   "default-runtime": "habana",
   "runtimes": {
      "habana": {
         "path": "/usr/bin/habana-container-runtime",
         "runtimeArgs": []
      }
   }
}

Restart Docker:
```
sudo systemctl restart docker
```

If a host machine has eight Gaudi devices, you can mount all using the environment variable HABANA_VISIBLE_DEVICES=all. The below shows the usage example:

docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all {docker image} /bin/bash -c "ls /dev/ac*"
accel0
accel1
accel2
accel3
accel4
accel5
accel6
accel7
accel_controlD0
accel_controlD1
accel_controlD2
accel_controlD3
accel_controlD4
accel_controlD5
accel_controlD6
accel_controlD7

This variable controls which Intel Gaudi cards will be made accessible inside the container. Possible values:

0,1,2 … - A comma-separated list of index(es).
all - All Gaudi devices will be accessible. This is the default value.

ContainerD setup

sudo tee /etc/containerd/config.toml <<EOF
disabled_plugins = []
version = 2

[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "habana"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana]
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options]
            BinaryName = "/usr/bin/habana-container-runtime"
  [plugins."io.containerd.runtime.v1.linux"]
    runtime = "habana-container-runtime"
EOF

Restart containerd:
```
sudo systemctl restart containerd
```

CRI-O setup

Create a new configuration file at /etc/crio/crio.conf.d/99-habana-ai.conf:

[crio.runtime]
default_runtime = "habana-ai"

[crio.runtime.runtimes.habana-ai]
runtime_path = "/usr/local/habana/bin/habana-container-runtime"
monitor_env = [
        "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
]

Restart CRI-O service: systemctl restart crio.service.

Pull Prebuilt Containers¶

Prebuilt containers are provided in:

Intel Gaudi vault
Amazon ECR Public Library
AWS Deep Learning Containers (DLC)

Pull and Launch Docker Image - Intel Gaudi Vault¶

Note

Before running Docker, make sure to map the dataset as detailed in Map Dataset to Docker.

Use the below commands to pull and run Dockers. Make sure to update the below command with the required Operating System. See the Support Matrix for a list of supported Operating Systems:

     docker pull vault.habana.ai/gaudi-docker/1.15.1/{$OS}/habanalabs/pytorch-installer-2.2.0:latest

     docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/{$OS}/habanalabs/pytorch-installer-2.2.0:latest

Note

Include –ipc=host in the Docker run command for PyTorch Docker images. This is required for distributed training using the Habana Collective Communication Library (HCCL); allowing re-use of host shared memory for best performance.
To run the Docker image with a partial number of the supplied Gaudi devices, make sure to set the Device to Module mapping correctly. See Multiple Dockers Each with a Single Workload for further details.

Amazon ECR Public Gallery¶

To pull and run Docker images from Amazon ECR Public Library, make sure to follow the steps detailed in Pulling a public image.

AWS Deep Learning Containers¶

To set up and use AWS Deep Learning containers, follow the instructions detailed in AWS Available Deep Learning Containers Images.

Build Docker Images from Intel Gaudi Dockerfiles¶

Download Docker files and build script from the Setup and Install Repo to a local directory.
Run the build script to generate a Docker image:

./docker_build.sh mode [pytorch] os [ubuntu22.04,amzn2] framework_version

For example:

./docker_build.sh ubuntu22.04 2.2.0

Launch Docker Image that was Built¶

Note

Before running docker, make sure to map the dataset as detailed in Map Dataset to Docker.

Use the below commands to launch the docker image. Make sure to update the below command with the required Operating System. See the Support Matrix for a list of supported Operating Systems:

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/${OS}/habanalabs/pytorch-installer-2.2.0:latest

Map Dataset to Docker¶

Make sure to download the dataset prior to running docker and mount the location of your dataset to the docker by adding the below flag. For example, host dataset location /opt/datasets/imagenet will mount to /datasets/imagenet inside the docker:

-v /opt/datasets/imagenet:/datasets/imagenet

Note

OPTIONAL: Add the following flag to mount a local host share folder to the docker in order to be able to transfer files out of docker:

-v $HOME/shared:/root/shared

Gaudi Documentation 1.15.1 documentation

AWS Deep Learning AMI (DLAMI) Installation

On this Page

AWS Deep Learning AMI (DLAMI) Installation¶

Set up Python for Models¶

Run Using Containers¶

Set up Container Usage¶

Install Container Runtime¶

Set up Container Runtime¶

Pull Prebuilt Containers¶

Pull and Launch Docker Image - Intel Gaudi Vault¶

Amazon ECR Public Gallery¶

AWS Deep Learning Containers¶

Build Docker Images from Intel Gaudi Dockerfiles¶

Launch Docker Image that was Built¶

Map Dataset to Docker¶