Docker Installation
On this Page
Docker Installation¶
Configure Container Runtime¶
To register the habana runtime, use the method below that is best
suited to your environment. You might need to merge the new argument
with your existing configuration.
Note
As of Kubernetes 1.20, support for Docker has been deprecated.
Register
habanaruntime by adding the following to /etc/docker/daemon.json:sudo tee /etc/docker/daemon.json <<EOF { "runtimes": { "habana": { "path": "/usr/bin/habana-container-runtime", "runtimeArgs": [] } } } EOF
(Optional) Set the default runtime by adding the following to
/etc/docker/daemon.json. Setting the default runtime ashabanawill route all your workloads through this runtime. However, any generic workloads will automatically be forwarded to a generic runtime. If you prefer not to set the default runtime, you can skip this step and override the runtime setting for the running container by using the--runtimeflag in thedocker runcommand:"default-runtime": "habana"
Your
/etc/docker/daemon.jsonshould look similar to this:{ "default-runtime": "habana", "runtimes": { "habana": { "path": "/usr/bin/habana-container-runtime", "runtimeArgs": [] } } }
Restart Docker:
sudo systemctl restart docker
Register
habanaruntime:sudo tee /etc/containerd/config.toml <<EOF disabled_plugins = [] version = 2 [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "habana" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana] runtime_type = "io.containerd.runc.v2" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options] BinaryName = "/usr/bin/habana-container-runtime" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true [plugins."io.containerd.runtime.v1.linux"] runtime = "habana-container-runtime" EOF
Restart containerd:
sudo systemctl restart containerd
Create a new configuration file at
/etc/crio/crio.conf.d/99-habana-ai.conf:[crio.runtime] default_runtime = "habana-ai" [crio.runtime.runtimes.habana-ai] runtime_path = "/usr/local/habana/bin/habana-container-runtime" monitor_env = [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", ]
Restart CRI-O service:
systemctl restart crio.service.
Use Intel Gaudi Containers¶
You can either pull prebuilt containers as described below for Ubuntu 24.04.3 and Ubuntu 22.04.5 or build custom Docker images as detailed in the Setup and Install Repo for any supported operating system. For further details on the supported operating systems, refer to Support Matrix.
Prebuilt containers are provided in the Intel Gaudi vault for Ubuntu 24.04.3 and Ubuntu 22.04.5 only. Use the below commands to pull and run Dockers from Intel Gaudi vault.
Pull Docker using the following command:
docker pull vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest
docker pull vault.habana.ai/gaudi-docker/1.23.0/ubuntu22.04/habanalabs/pytorch-installer-2.9.0:latest
Run Docker using the following command. Make sure to include
--ipc=host. This is required for distributed training using the Habana Collective Communication Library (HCCL), allowing re-use of host shared memory for best performance:docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.23.0/ubuntu24.04/habanalabs/pytorch-installer-2.9.0:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.23.0/ubuntu22.04/habanalabs/pytorch-installer-2.9.0:latest
Note
Please note that starting from 1.18.0 release, SSH host keys have been removed from Dockers. To add them, make sure to run
/usr/bin/ssh-keygen -Ainside the Docker container. If you are running on Kubernetes, make sure the SSH host keys are identical across all Docker containers. To achieve this, you can either build a new Docker image on top of Intel Gaudi Docker image by adding a new layerRUN /usr/bin/ssh-keygen -A, or externally mount the SSH host keys.- To run the Docker image with a partial number of the supplied Gaudi devices, make sure to set the device to module mapping correctly.
See Multiple Dockers Each with a Single Workload for further details.
You can also use prebuilt containers provided in Amazon ECR Public Library and AWS Available Deep Learning Containers Images.
Docker Troubleshooting on Intel Gaudi 3¶
If Docker fails to start or run properly on your Intel Gaudi 3 instance, follow the troubleshooting steps below to identify and resolve common configuration and runtime issues.
Verify Docker service:
Check whether the Docker service is active:
systemctl status docker
Alternatively, list running containers:
docker ps
If you see Cannot connect to the Docker daemon error message, the Docker service may not be running or your user might not belong to the
dockergroup. In this case, check user permissions:Verify if your user is in the
dockergroup:groups $USER
If not, add your user to the group:
sudo usermod -aG docker $USER newgrp docker
Verify Intel Gaudi runtime dependencies. Intel Gaudi Docker images require the Habana runtime and drivers, including
habana-dockerplugin,hl-thunk,synapse,hccl, etc. Check if Habana runtime packages are installed:dpkg -l | grep habana
Confirm Image and Runtime Compatibility:
Make sure you are using a Gaudi 3-compatible Docker image.
If you see unknown runtime hpu error message, make sure to configure the Habana runtime in
/etc/docker/daemon.jsonas described above in Configure Container Runtime.
Collect logs for further diagnosis:
View the recent Docker logs:
journalctl -u docker --no-pager | tail -n 50
Check the kernel messages related to Habana:
dmesg | grep -i habana
Note
If the issue persists, collect the above logs and share them for further analysis as more details are needed to identify the root cause.