Habana Deep Learning Base AMI Installation
On this Page
Habana Deep Learning Base AMI Installation¶
The following table outlines the supported installation options and the steps required. Habana’s Base AMI is delivered preinstalled with the necessary installation to run containers.
Objective |
Steps |
|||
---|---|---|---|---|
Run Using Containers on Habana Base AMI (Recommended) |
||||
Run PyTorch on Habana Base AMI |
Habana Deep Learning AMI also includes AMIs on Amazon ECS and Amazon EKS. See Amazon ECS with Gaudi User Guide and Amazon EKS with Gaudi User Guide for more details.
Note
Before installing the below packages and Dockers, make sure to review the currently supported versions and operating systems listed in the Support Matrix.
Run Using Containers on Habana Base AMI¶
Pull Prebuilt Containers¶
Prebuilt containers are provided in:
Intel Gaudi vault
Amazon ECR Public Library
AWS Deep Learning Containers (DLC)
Pull and Launch Docker Image - Intel Gaudi Vault¶
Use the below command to pull Docker:
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker pull vault.habana.ai/gaudi-docker/1.18.0/amzn2/habanalabs/pytorch-installer-2.4.0:latest
Use the below command to run Docker:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/amzn2/habanalabs/pytorch-installer-2.4.0:latest
Note
Include
--ipc=host
in the Docker run command for the Docker images. This is required for distributed training using the Habana Collective Communication Library (HCCL); allowing re-use of host shared memory for best performance.To run the Docker image with a partial number of the supplied Gaudi devices, make sure to set the Device to module mapping correctly. See Multiple Dockers Each with a Single Workload for further details.
Amazon ECR Public Gallery¶
To pull and run Docker images from Amazon ECR Public Library, make sure to follow the steps detailed in Pulling a public image.
AWS Deep Learning Containers¶
To set up and use AWS Deep Learning containers, follow the instructions detailed in AWS Available Deep Learning Containers Images.
Build Docker Images from Intel Gaudi Dockerfiles¶
To build custom Docker images, follow the steps as described in the Setup and Install Repo.
Launch Docker Image¶
Use the below command to launch the Docker image. Make sure to update the below command with the required operating system. See the Support Matrix for a list of supported operating systems:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/${OS}/habanalabs/pytorch-installer-2.4.0:latest
Run PyTorch on Habana Base AMI¶
Install PyTorch¶
This section describes how to obtain and install the PyTorch software package. Follow the instructions below to install PyTorch packages on a bare metal platform or virtual machine.
Note
Installing PyTorch with Docker is the recommended installation method and does not require additional steps. For further details, refer to Pull and Launch Docker Image - Intel Gaudi Vault section.
Intel Gaudi PyTorch packages consist of:
torch
- PyTorch framework package with Intel Gaudi support.habana-torch-plugin
- Libraries and modules needed to execute PyTorch on single card, single-server and multi-server setup.habana-torch-dataloader
- Intel Gaudi multi-threaded dataloader package.torchvision
andtorchaudio
- Torchvision and Torchaudio packages compiled intorch
environment. No Gaudi specific changes in this package.torch-tb-profiler
- The Tensorboard plugin used to display Gaudi-specific information on TensorBoard.
Run the
hl-smi
tool to confirm the Intel Gaudi software version installed. You will need to use the correct version of the installer based on the version you are running. For example, if the installed version is 1.17.1, you should see the below:HL-SMI Version: hl-1.17.1-XXXXXXX Driver Version: 1.17.1-XXXXXX
Install the Intel Gaudi PyTorch environment by running the following command:
wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh chmod +x habanalabs-installer.sh ./habanalabs-installer.sh install -t dependencies ./habanalabs-installer.sh install --type pytorch --venv
Note
Installing dependencies requires sudo permission.
Verify that PyTorch is already installed in the path listed in the
PYTHONPATH
environment variable. If it is, uninstall it before proceeding or remove the path from thePYTHONPATH
.This script supports fresh installations only. SW upgrades are not supported.
The -- venv
flag installs PyTorch inside the virtual environment. The default virtual environment folder is $HOME/habanalabs-venv
.
To override the default, run the following command:
export HABANALABS_VIRTUAL_DIR=xxxx
Model References Requirements¶
Some PyTorch models need additional Python packages. They can be installed using Python requirements files provided in Model References repository. Refer to Model References repository for detailed instructions on running PyTorch models.
Set up Python for Models¶
Using your own models requires setting Python 3.10 as the default Python version. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:
export PYTHON=/usr/bin/python3.10
Running models from Intel Gaudi Model References GitHub repository, requires the PYTHON environment variable to match the supported Python release:
export PYTHON=/usr/bin/<python version>
Note
The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.