AWS Base OS AMI Installation
On this Page
AWS Base OS AMI Installation¶
The following table outlines the steps required when using a standard (non-DL) AMI image to set up the EC2 instance.
Objective |
Steps |
|
---|---|---|
Run PyTorch on AWS Base OS AMI |
||
Run Using Containers on AWS Base OS AMI |
Note
Before installing the below packages and Dockers, make sure to review the currently supported versions and operating systems listed in the Support Matrix.
Run PyTorch on AWS Base OS AMI¶
Set Up Intel Gaudi SW Stack¶
Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel® Gaudi® software package (apt get, yum install or pip install etc.). The installation contains the following Installers:
habanalabs-graph
– installs the graph compiler and the run-time.habanalabs-thunk
– installs the Thunk library.habanalabs-dkms
– installs thehabanalabs
,habanalabs_cn
,habanalabs_en
andhabanalabs_ib
driver. Thehabanalabs_ib
driver is supported on Gaudi 2 only.habanalabs-rdma-core
- installs IBVerbs libraries which provide Intel Gaudi’slibhlib
along withlibibverbs
. Thehabanalabs-rdma-core
package is supported on Gaudi 2 only.habanalabs-firmware
- installs the Gaudi firmware.habanalabs-firmware-tools
– installs various firmware tools (hlml, hl-smi, etc).habanalabs-qual
– installs the qualification application package. See Qualification Library.habanalabs-container-runtime
- installs the container runtime library.
Run the
hl-smi
tool to confirm the Intel Gaudi software version installed. You will need to use the correct version of the installer based on the version you are running. For example, if the installed version is 1.19.0, you should see the below:HL-SMI Version: hl-1.19.0-XXXXXXX Driver Version: 1.19.0-XXXXXX
Install the Intel Gaudi SW stack by running the following command:
wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.19.1/habanalabs-installer.sh chmod +x habanalabs-installer.sh ./habanalabs-installer.sh install --type base
Note
The installation sets the number of huge pages automatically.
To install each installer separately, refer to the detailed instructions in Custom Driver and Software Installation.
This script supports fresh installations only. SW upgrades are not supported.
For further instructions on how to control the script attributes, refer to the help guide by running the following command:
./habanalabs-installer.sh --help
Install PyTorch¶
This section describes how to obtain and install the PyTorch software package. Follow the instructions below to install PyTorch packages on a bare metal platform or virtual machine.
Note
Installing PyTorch with Docker is the recommended installation method and does not require additional steps. For further details, refer to Pull and Launch Docker Image - Intel Gaudi Vault section.
Intel Gaudi PyTorch packages consist of:
torch
- PyTorch framework package with Intel Gaudi support.habana-torch-plugin
- Libraries and modules needed to execute PyTorch on single card, single-server and multi-server setup.habana-torch-dataloader
- Intel Gaudi multi-threaded dataloader package.torchvision
andtorchaudio
- Torchvision and Torchaudio packages compiled intorch
environment. No Gaudi specific changes in this package.torch-tb-profiler
- The Tensorboard plugin used to display Gaudi-specific information on TensorBoard.
Run the
hl-smi
tool to confirm the Intel Gaudi software version installed. You will need to use the correct version of the installer based on the version you are running. For example, if the installed version is 1.19.0, you should see the below:HL-SMI Version: hl-1.19.0-XXXXXXX Driver Version: 1.19.0-XXXXXX
Install the Intel Gaudi PyTorch environment by running the following command:
wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.19.1/habanalabs-installer.sh chmod +x habanalabs-installer.sh ./habanalabs-installer.sh install -t dependencies ./habanalabs-installer.sh install --type pytorch --venv
Note
Installing dependencies requires sudo permission.
Verify that PyTorch is already installed in the path listed in the
PYTHONPATH
environment variable. If it is, uninstall it before proceeding or remove the path from thePYTHONPATH
.This script supports fresh installations only. SW upgrades are not supported.
The -- venv
flag installs PyTorch inside the virtual environment. The default virtual environment folder is $HOME/habanalabs-venv
.
To override the default, run the following command:
export HABANALABS_VIRTUAL_DIR=xxxx
Model References Requirements¶
Some PyTorch models need additional Python packages. They can be installed using Python requirements files provided in Model References repository. Refer to Model References repository for detailed instructions on running PyTorch models.
Run Using Containers on AWS Base OS AMI¶
Set up Intel Gaudi SW Stack¶
Follow the steps below while running on Ubuntu 22.04:
Package Retrieval:
Download and install the public key:
curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
Get the name of the operating system:
lsb_release -c | awk '{print $2}'
Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.
Update Debian cache:
sudo dpkg --configure -a sudo apt-get update
Firmware Installation:
To install the FW, run the following:
sudo apt install -y habanalabs-firmware
Driver Installation:
The habanalabs-dkms_all
package installs the habanalabs
, habanalabs_cn
, habanalabs_en
(Ethernet) and habanalabs_ib
drivers.
If automation scripts are used, the scripts must be modified to load/unload the drivers.
Note
habanalabs_ib
driver is available on Gaudi 2 only.
Run the below command to install all drivers:
sudo apt install -y habanalabs-dkms
Unload the drivers in this order -
habanalabs
,habanalabs_cn
,habanalabs_en
andhabanalabs_ib
:sudo modprobe -r <driver name>
Load the drivers in this order -
habanalabs_en
andhabanalabs_ib
,habanalabs_cn
,habanalabs
:sudo modprobe <driver name>
Set up Container Usage¶
To run containers, make sure to install and set up habanalabs-container-runtime
as detailed in the below sections.
Install Container Runtime¶
The habanalabs-container-runtime
is a modified
runc that installs the
container runtime library. This provides you the ability to select the
devices to be mounted in the container. You only need to
specify the indices of the devices for the container, and the container
runtime will handle the rest. The habanalabs-container-runtime
can
support both Docker and Kubernetes.
Note
Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin
, it is required to uncomment the following lines in config.toml to avoid failure:
#visible_devices_all_as_default = false
#mount_accelerators = false
Follow the steps below while running on Ubuntu 22.04.
Package Retrieval:
Download and install the public key:
curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
Get the name of the operating system:
lsb_release -c | awk '{print $2}'
Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.
Update Debian cache:
sudo dpkg --configure -a sudo apt-get update
Install habanalabs-container-runtime:
Install the habanalabs-container-runtime
package:
sudo apt install -y habanalabs-container-runtime
Set up Container Runtime¶
To register the habana
runtime, use the method below that is best
suited to your environment. You might need to merge the new argument
with your existing configuration.
Note
As of Kubernetes 1.20, support for Docker has been deprecated.
Register
habana
runtime by adding the following to /etc/docker/daemon.json:sudo tee /etc/docker/daemon.json <<EOF { "runtimes": { "habana": { "path": "/usr/bin/habana-container-runtime", "runtimeArgs": [] } } } EOF
(Optional) Reconfigure the default runtime by adding the following to
/etc/docker/daemon.json
. Setting the default runtime ashabana
will route all your workloads through this runtime. However, any generic workloads will automatically be forwarded to a generic runtime. If you prefer not to set the default runtime, you can skip this step and override the runtime setting for the running container by using the--runtime
flag in thedocker run
command:"default-runtime": "habana"
Your code should look similar to this:
{ "default-runtime": "habana", "runtimes": { "habana": { "path": "/usr/bin/habana-container-runtime", "runtimeArgs": [] } } }
Restart Docker:
sudo systemctl restart docker
If a host machine has eight Gaudi devices, you can mount all using the environment variable HABANA_VISIBLE_DEVICES=all
. The below shows the usage example:
docker run --rm --runtime=habana -e HABANA_VISIBLE_DEVICES=all {docker image} /bin/bash -c "ls /dev/ac*"
accel0
accel1
accel2
accel3
accel4
accel5
accel6
accel7
accel_controlD0
accel_controlD1
accel_controlD2
accel_controlD3
accel_controlD4
accel_controlD5
accel_controlD6
accel_controlD7
This variable controls which Intel Gaudi cards will be made accessible inside the container. Possible values:
0,1,2 … - A comma-separated list of index(es).
all - All Gaudi devices are accessible. This is the default value.
Register
habana
runtime:sudo tee /etc/containerd/config.toml <<EOF disabled_plugins = [] version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "habana" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options] BinaryName = "/usr/bin/habana-container-runtime" [plugins."io.containerd.runtime.v1.linux"] runtime = "habana-container-runtime" EOF
Restart containerd:
sudo systemctl restart containerd
Create a new configuration file at
/etc/crio/crio.conf.d/99-habana-ai.conf
:[crio.runtime] default_runtime = "habana-ai" [crio.runtime.runtimes.habana-ai] runtime_path = "/usr/local/habana/bin/habana-container-runtime" monitor_env = [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", ]
Restart CRI-O service:
systemctl restart crio.service
Pull Prebuilt Containers¶
Prebuilt containers are provided in:
Intel Gaudi vault
Amazon ECR Public Library
AWS Deep Learning Containers (DLC)
Pull and Launch Docker Image - Intel Gaudi Vault¶
Use the below command to pull Docker:
docker pull vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
Use the below command to run Docker:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
Note
Include
--ipc=host
in the Docker run command for the Docker images. This is required for distributed training using the Habana Collective Communication Library (HCCL); allowing re-use of host shared memory for best performance.To run the Docker image with a partial number of the supplied Gaudi devices, make sure to set the Device to module mapping correctly. See Multiple Dockers Each with a Single Workload for further details.
Amazon ECR Public Gallery¶
To pull and run Docker images from Amazon ECR Public Library, make sure to follow the steps detailed in Pulling a public image.
AWS Deep Learning Containers¶
To set up and use AWS Deep Learning containers, follow the instructions detailed in AWS Available Deep Learning Containers Images.
Build Docker Images from Intel Gaudi Dockerfiles¶
To build custom Docker images, follow the steps as described in the Setup and Install Repo.
Launch Docker Image¶
Use the below command to launch the Docker image. Make sure to update the below command with the required operating system. See the Support Matrix for a list of supported operating systems:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.1/${OS}/habanalabs/pytorch-installer-2.5.1:latest
Set up Python for Models¶
Using your own models requires setting Python 3.10 as the default Python version. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:
export PYTHON=/usr/bin/python3.10
Running models from Intel Gaudi Model References GitHub repository, requires the PYTHON environment variable to match the supported Python release:
export PYTHON=/usr/bin/<python version>
Note
The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.