Docker Installation

Configure Container Runtime

To register the habana runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration.

Note

As of Kubernetes 1.20, support for Docker has been deprecated.

  1. Register habana runtime by adding the following to /etc/docker/daemon.json:

    sudo tee /etc/docker/daemon.json <<EOF
    {
       "runtimes": {
          "habana": {
                "path": "/usr/bin/habana-container-runtime",
                "runtimeArgs": []
          }
       }
    }
    EOF
    
  2. (Optional) Set the default runtime by adding the following to /etc/docker/daemon.json. Setting the default runtime as habana will route all your workloads through this runtime. However, any generic workloads will automatically be forwarded to a generic runtime. If you prefer not to set the default runtime, you can skip this step and override the runtime setting for the running container by using the --runtime flag in the docker run command:

    "default-runtime": "habana"
    

    Your /etc/docker/daemon.json should look similar to this:

    {
       "default-runtime": "habana",
       "runtimes": {
          "habana": {
             "path": "/usr/bin/habana-container-runtime",
             "runtimeArgs": []
          }
       }
    }
    
  3. Restart Docker:

    sudo systemctl restart docker
    
  1. Register habana runtime:

    sudo tee /etc/containerd/config.toml <<EOF
    disabled_plugins = []
    version = 2
    
    [plugins]
      [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "habana"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana]
              runtime_type = "io.containerd.runc.v2"
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options]
                BinaryName = "/usr/bin/habana-container-runtime"
      [plugins."io.containerd.runtime.v1.linux"]
        runtime = "habana-container-runtime"
    EOF
    
  2. Restart containerd:

    sudo systemctl restart containerd
    
  1. Create a new configuration file at /etc/crio/crio.conf.d/99-habana-ai.conf:

    [crio.runtime]
    default_runtime = "habana-ai"
    
    [crio.runtime.runtimes.habana-ai]
    runtime_path = "/usr/local/habana/bin/habana-container-runtime"
    monitor_env = [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
    ]
    
  1. Restart CRI-O service: systemctl restart crio.service.

Use Intel Gaudi Containers

You can either pull prebuilt containers as described below or build custom Docker images as detailed in the Setup and Install Repo.

Prebuilt containers are provided in the Intel Gaudi vault. Use the below commands to pull and run Dockers from Intel Gaudi vault.

  1. Pull Docker using the following command:

       docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu24.04/habanalabs/pytorch-installer-2.4.0:latest
    
       docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
    
       docker pull vault.habana.ai/gaudi-docker/1.18.0/amzn2/habanalabs/pytorch-installer-2.4.0:latest
    
       docker pull vault.habana.ai/gaudi-docker/1.18.0/rhel8.6/habanalabs/pytorch-installer-2.4.0:latest
    
       docker pull vault.habana.ai/gaudi-docker/1.18.0/rhel9.2/habanalabs/pytorch-installer-2.4.0:latest
    
       docker pull vault.habana.ai/gaudi-docker/1.18.0/rhel9.4/habanalabs/pytorch-installer-2.4.0:latest
    
       docker pull vault.habana.ai/gaudi-docker/1.18.0/tencentos3.1/habanalabs/pytorch-installer-2.4.0:latest
    
       docker pull vault.habana.ai/gaudi-docker/1.18.0/suse15.5/habanalabs/pytorch-installer-2.4.0:latest
    
  2. Run Docker using the following command. Make sure to include --ipc=host. This is required for distributed training using the Habana Collective Communication Library (HCCL), allowing re-use of host shared memory for best performance:

       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu24.04/habanalabs/pytorch-installer-2.4.0:latest
    
       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
    
       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/amzn2/habanalabs/pytorch-installer-2.4.0:latest
    
       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/rhel8.6/habanalabs/pytorch-installer-2.4.0:latest
    
       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/rhel9.2/habanalabs/pytorch-installer-2.4.0:latest
    
       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/rhel9.4/habanalabs/pytorch-installer-2.4.0:latest
    
       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/tencentos3.1/habanalabs/pytorch-installer-2.4.0:latest
    
       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v /opt/datasets:/datasets --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/suse15.5/habanalabs/pytorch-installer-2.4.0:latest
    

    Note

    • Please note that starting from 1.18.0 release, SSH host keys have been removed from Dockers. To add them, make sure to run /usr/bin/ssh-keygen -A inside the Docker container. If you are running on Kubernetes, make sure the SSH host keys are identical across all Docker containers. To achieve this, you can either build a new Docker image on top of Intel Gaudi Docker image by adding a new layer RUN /usr/bin/ssh-keygen -A, or externally mount the SSH host keys.

    • To run the Docker image with a partial number of the supplied Gaudi devices, make sure to set the device to module mapping correctly. See Multiple Dockers Each with a Single Workload for further details.

    • You can also use prebuilt containers provided in Amazon ECR Public Library and AWS Available Deep Learning Containers Images.