Intel Gaudi Software Stack and Driver InstallationĀ¶

The following sections describe how to obtain and install IntelĀ® GaudiĀ® software and drivers on a bare metal system, either on a fresh OS or an existing system. Make sure to review the currently supported versions and operating systems listed in the Support Matrix.

To run on a bare metal system, make sure to install the firmware, drivers and Intel Gaudi PyTorch environment as detailed in the following sections. You have the option to either install on a fresh OS or upgrade an existing system.

Install Intel Gaudi SW Stack on a Fresh OS

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install the Intel Gaudi PyTorch environment. For the full list of the packages included, refer to Intel Gaudi PyTorch Packages. If you are using RHEL 8.6, TencentOS 3.1 or Amazon Linux 2 operating systems with PyTorch 2.3.1, installing GCC versions 9 or later is required.

    ./habanalabs-installer.sh install -t dependencies
    ./habanalabs-installer.sh install --type pytorch
    

    Make sure to install Intel Gaudi PyTorch package in $HOME/.local/lib/{PYTHON_VER}/site-packages folder. The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

    Adding the -- venv flag to the above command installs PyTorch inside the virtual environment. The default virtual environment folder is $HOME/habanalabs-venv. You can override the default by using the following export HABANALABS_VIRTUAL_DIR=xxxx.

    To activate the virtual environment, go to the directory where your virtual environment is installed and run source ./bin/activate command.

    Note

    • Installing dependencies requires sudo permission.

    • Verify that PyTorch is already installed in the path listed in the PYTHONPATH environment variable. If it is, uninstall it before proceeding or remove the path from the PYTHONPATH.

  4. Set Python 3.10 as the default Python version when using your own models. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

    export PYTHON=/usr/bin/python3.10
    

    Running models from Intel Gaudi Model References GitHub repository requires the PYTHON environment variable to match the supported Python release:

    export PYTHON=/usr/bin/<python version>
    

    Note

    The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

If you want to resume the system level installation, refer to Environment Variables and Configurations Update.

Upgrade Intel Gaudi SW Stack

  1. Upgrade the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type all
    

    Note

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Set Python 3.10 as the default Python version when using your own models. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

    export PYTHON=/usr/bin/python3.10
    

    Running models from Intel Gaudi Model References GitHub repository requires the PYTHON environment variable to match the supported Python release:

    export PYTHON=/usr/bin/<python version>
    

    Note

    The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

If you want to resume the system level installation, refer to Environment Variables and Configurations Update.

Install Intel Gaudi SW Stack

To run using containers, make sure to install the firmware, drivers and Intel Gaudi container-runtime as detailed below for each operating system. You have the option to either install on a fresh OS or upgrade an existing system.

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • Ubuntu 24.04 installation is available on Gaudi 3 only.

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo apt install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo apt install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • Amazon Linux 2 installation is available on first-gen Gaudi only.

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo yum install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • RHEL8.6 installation is available on Gaudi 2 only.

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo yum install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • RHEL9.2 installation is available on Gaudi 2 only.

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo yum install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • RHEL9.4 installation is available on Gaudi 2 and Gaudi 3 only.

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo yum install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • TencentOS 3.1 installation is available on Gaudi 2 only.

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo yum install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

  1. Install the Intel Gaudi SW stack. For further details on the package installers included, see Intel Gaudi Software Installers table.

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh install --type base
    

    Note

    • SUSE 15.5 installation is available on Gaudi 3 only.

    • Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get, yum install or pip install etc.).

    • The installation sets the number of huge pages automatically.

    • To install each installer separately, refer to the detailed instructions in Installing Intel Gaudi SW Packages Individually.

    To upgrade an existing system, run the following command:

    wget -nv https://vault.habana.ai/artifactory/gaudi-installer/1.18.0/habanalabs-installer.sh
    chmod +x habanalabs-installer.sh
    ./habanalabs-installer.sh upgrade --type base
    

    For further instructions on how to control the script attributes, refer to the help guide by running the following command:

    ./habanalabs-installer.sh --help
    
  2. Bring up the network interfaces by running the command below. Ensure the network interfaces are brought up when training using external Gaudi network interfaces between servers for multi-server scale-out. These interfaces need to be brought up every time the kernel module is loaded or unloaded and reloaded. A reference on how to bring up the interfaces is provided in the manage_network_ifs.sh. Before bringing up the network interfaces, make sure to install ethtool according to your operating system.

    # manage_network_ifs.sh requires ethtool
    ./manage_network_ifs.sh --up
    
  3. Install habanalabs-container-runtime package. The habanalabs-container-runtime is a modified runc that installs the container runtime library. This provides you the ability to select the devices to be mounted in the container. You only need to specify the indices of the devices for the container, and the container runtime will handle the rest. Both Docker and Kubernetes are supported.

    sudo zypper install -y habanalabs-container-runtime
    

    Note

    Important: If you run container runtime in Kubernetes with habana-k8s-device-plugin, it is required to uncomment the following lines in config.toml to avoid failure:

    • #visible_devices_all_as_default = false

    • #mount_accelerators = false

Set up Container Runtime

To register the habana runtime, use the method below that is best suited to your environment. You might need to merge the new argument with your existing configuration.

Note

As of Kubernetes 1.20, support for Docker has been deprecated.

  1. Register habana runtime by adding the following to /etc/docker/daemon.json:

    sudo tee /etc/docker/daemon.json <<EOF
    {
       "runtimes": {
          "habana": {
                "path": "/usr/bin/habana-container-runtime",
                "runtimeArgs": []
          }
       }
    }
    EOF
    
  2. (Optional) Reconfigure the default runtime by adding the following to /etc/docker/daemon.json:

    "default-runtime": "habana"
    

    Your code should look similar to this:

    {
       "default-runtime": "habana",
       "runtimes": {
          "habana": {
             "path": "/usr/bin/habana-container-runtime",
             "runtimeArgs": []
          }
       }
    }
    
  3. Restart Docker:

    sudo systemctl restart docker
    
  1. Register habana runtime:

    sudo tee /etc/containerd/config.toml <<EOF
    disabled_plugins = []
    version = 2
    
    [plugins]
      [plugins."io.containerd.grpc.v1.cri"]
        [plugins."io.containerd.grpc.v1.cri".containerd]
          default_runtime_name = "habana"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana]
              runtime_type = "io.containerd.runc.v2"
              [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.habana.options]
                BinaryName = "/usr/bin/habana-container-runtime"
      [plugins."io.containerd.runtime.v1.linux"]
        runtime = "habana-container-runtime"
    EOF
    
  2. Restart containerd:

    sudo systemctl restart containerd
    
  1. Create a new configuration file at /etc/crio/crio.conf.d/99-habana-ai.conf:

    [crio.runtime]
    default_runtime = "habana-ai"
    
    [crio.runtime.runtimes.habana-ai]
    runtime_path = "/usr/local/habana/bin/habana-container-runtime"
    monitor_env = [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
    ]
    
  1. Restart CRI-O service: systemctl restart crio.service.

Use Intel Gaudi Containers

You can either pull prebuilt containers as described below or build custom Docker images as detailed in the Setup and Install Repo.

Prebuilt containers are provided in the Intel Gaudi vault. Use the below commands to pull and run Dockers from Intel Gaudi vault.

  1. Download the dataset prior to running Docker and mount the location of your dataset to the Docker by adding the below flag. For example, host dataset location /opt/datasets/imagenet mounts to /datasets/imagenet inside the Docker:

    -v /opt/datasets/imagenet:/datasets/imagenet
    
  2. (OPTIONAL) Add the following flag to mount a local host share folder to the Docker in order to transfer files from Docker:

    -v $HOME/shared:/root/shared
    
  3. Pull Docker using the following command. Make sure to update the below command with the required operating system. See the Support Matrix for a list of supported operating systems:

       docker pull vault.habana.ai/gaudi-docker/1.18.0/{$OS}/habanalabs/pytorch-installer-2.4.0:latest
    
  4. Run Docker. Make sure to include --ipc=host. This is required for distributed training using the Habana Collective Communication Library (HCCL), allowing re-use of host shared memory for best performance:

       docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/{$OS}/habanalabs/pytorch-installer-2.4.0:latest
    

    Note

    • Please note that starting from 1.18.0 release, SSH host keys have been removed from Dockers. To add them, make sure to run /usr/bin/ssh-keygen -A inside the Docker container. If you are running on Kubernetes, make sure the SSH host keys are identical across all Docker containers. To achieve this, you can either build a new Docker image on top of Intel Gaudi Docker image by adding a new layer RUN /usr/bin/ssh-keygen -A, or externally mount the SSH host keys.

    • To run the Docker image with a partial number of the supplied Gaudi devices, make sure to set the device to module mapping correctly. See Multiple Dockers Each with a Single Workload for further details.

    • You can also use prebuilt containers provided in Amazon ECR Public Library and AWS Available Deep Learning Containers Images.

Set up Python for Models

Using your own models requires setting Python 3.10 as the default Python version. If Python 3.10 is not the default version, replace any call to the Python command on your model with $PYTHON and define the environment variable as below:

export PYTHON=/usr/bin/python3.10

Running models from Intel Gaudi Model References GitHub repository, requires the PYTHON environment variable to match the supported Python release:

export PYTHON=/usr/bin/<python version>

Note

The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

If you want to resume the System level installation, see Environment Variables and Configurations Update.