Platform Upgrade and Full System Installation

The following steps guide an IT administrator through a complete system installation which includes assembling a physical system, loading the operating system, and installing the Intel® Gaudi® software and driver.

BMC Access

Ensure access to the BMC via IP address, whether the BMC IP is set to static or dynamic:

  • BMC enables access to the CPU subsystem and Gaudi 2 subsystem (HIB/UBB).

  • HLS-2 has two physical RJ45 ports for BMC access and two IP addresses; each for CPU and Gaudi 2 system.

  • Supermicro server has one physical RJ45 port and one IP to access CPU and Gaudi 2 systems.

Check Gaudi on the Platform

Check if all eight Gaudi cards are visible on the system by running the lspci command below:

$ lspci -d :1020: -nn
19:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
1a:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
43:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
44:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
b3:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
b4:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
cc:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
cd:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)

Check and/or Upgrade the System Firmware Components

Before installing the firmware with habanalabs-firmware-odm, check that the following platform level components are updated:

  • CPLD - To upgrade the CPLD, refer to the Porting Guide available here.

  • Platform BIOS, and BMC FW - Refer to your system vendor documentation for details.

  • (HLS-2 only) PCIe switch version - Refer to your system vendor documentation for details.

See Support Matrix for the supported versions.

Note

Contact your local Intel Gaudi support representative if you do not have access to the Gaudi 2 HL225 Porting Guide.

Artifactory Access Token

Tokens for automated access to Artifactory can be generated and used as follows:

  1. Go to the Artifacts page.

  2. Click on the Welcome button on the upper right corner of the page.

  3. Select Edit Profile and generate an API Key.

  4. Copy the API Key to use for later.

Operating System Packages Installation

Installing the packages with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get and pip install etc.).

Note

Running the below commands installs the latest version only. You can install a version other than the latest by running the below commands with a specific build number.

Package Retrieval:

  1. Download and install the public key:

    curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
    
  2. Get the name of the operating system:

    lsb_release -c | awk '{print $2}'
    
  3. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  4. Update Debian cache:

    sudo dpkg --configure -a
    
    sudo apt-get update
    

Note

Amazon Linux 2 installation is available on first-gen Gaudi only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/AmazonLinux2
    
    enabled=1
    
    gpgcheck=0
    
    gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

Note

RHEL8.6 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/rhel/8/8.6
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

This will search for and list all packages with the word Habana.

  1. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Note

RHEL9.2 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/rhel/9/9.2
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

  4. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Note

Debian 10.10 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Download and install the public key:

    curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
    
  2. Get the name of the operating system:

    lsb_release -c | awk '{print $2}'
    
  3. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  4. Update Debian cache:

    sudo dpkg --configure -a
    
    sudo apt-get update
    

Note

TencentOS 3.1 installation is available on Gaudi 2 only.

Package Retrieval:

  1. Create /etc/yum.repos.d/Habana-Vault.repo with the following content:

    [vault]
    
    name=Habana Vault
    
    baseurl=https://vault.habana.ai/artifactory/tencentos/3/3.1
    
    enabled=1
    
    repo_gpgcheck=0
    
  2. Update YUM cache by running the following command:

    sudo yum makecache
    
  3. Verify correct binding by running the following command:

    yum search habana
    

    This command searches for and lists all packages with the word Habana.

  4. Reinstall libarchive package by following command:

    sudo dnf install -y libarchive*
    

Firmware Verification and Installation

  1. Verify the Gaudi SPI FW version by running the following command:

    hl-smi -L | grep SPI
    

    Only the SPI FW needs to be updated at the system level. Refer to the Support Matrix for the exact version. The below shows the expected output for the 1.16.1-7 release:

     Firmware [SPI] Version          : Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
    
  2. Install the FW by running the following command. Point to the Intel Gaudi vault location where the FW package was downloaded:

     sudo apt install -y ./path_to_file/habanalabs-firmware-odm-1.16.2-2.amd64.deb
    

Firmware Update

To update the FW, run the following command:

sudo hl-fw-loader

eROM Upgrade

When upgrading the FW on Gaudi 2, the eROM should also be upgraded. Before running the procedure, make sure you have the following:

  • Root privileges

  • BMC access

  • “gaudi2-agent-fw_loader-fit_erom.itb” file

Note

Upgrading the eROM is required if you are not using the latest eROM version. Refer to Support Matrix for the latest eROM version. To verify the installed eROM version, run sudo hl-smi --fw-version.

To upgrade the eROM, perform the following:

  1. Unload the drivers. If the habanalabs-dkms driver is already installed, the drivers must be unloaded before eROM update:

    sudo modprobe -r habanalabs && sudo modprobe -r habanalabs_cn
    sudo modprobe -r habanalabs_ib && sudo modprobe -r habanalabs_en
    
  2. Disable the eROM write protection by writing the value of 0x2e to address 8 of each one of the OAM CPLDs.

  3. Upgrade the eROM by running the following command. Point to the location of the .itb file:

    hl-fw-loader -f ./path_to_file/gaudi2-agent-fw_loader-fit_erom.itb
    
  4. Enable the eROM write protection by writing the value of 0x26 to address 8 of each of the OAM CPLDs.

Software Stack and Driver Installation

Install the SW driver, Intel Gaudi SW components, and Gaudi-specific environment by following the procedure in Intel Gaudi Software Stack and Driver Installation.

Environment Variables and Configurations Update

When the installation is complete, close the shell and re-open it. Or, run the following:

source /etc/profile.d/habanalabs.sh

source ~/.bashrc

EEPROM Update for HLS-2 Users

Updating the EEPROM is required for HLS-2 users only. EEPROM burning is not covered in this document.

System Verifications and Final Tests

  1. Run lsmod to verify the driver is loaded and running:

    $ lsmod | grep habana
    habanalabs           1572864  0
    habanalabs_cn         454656  8
    habanalabs_ib          73728  8
    habanalabs_en          61440  8
    
  2. Run hl-smi and verify the following is shown. Verify that the driver version in the hl-smi output matches the installed Intel Gaudi software versions and that the temperature (“Temp” column in the output) reflects non-zero value. If the temperature output is “0C”, then there is a problem during the card initialization. In this case, reboot the system and/or verify that the driver installation steps were correct.

../_images/hl_smi_report.png
  1. Run dmesg and ensure no errors are reported:

    $ dmesg | grep habana
    
  2. Re-check if the SPI FW version prefix matches the Intel Gaudi software driver version prefix by running the hl-smi command below:

         $ hl-smi -L | grep SPI
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
    
  3. Re-check all SW components by running the apt list command below:

         $ apt list --installed | grep habana
         habanalabs-container-runtime/focal,now 1.16.2-2 amd64 [installed]
         habanalabs-dkms/focal,focal,now 1.16.2-2 all [installed]
         habanalabs-firmware-tools/focal,now 1.16.2-2 amd64 [installed]
         habanalabs-firmware/focal,now 1.16.2-2 amd64 [installed]
         habanalabs-graph/focal,now 1.16.2-2 amd64 [installed]
         habanalabs-qual/focal,now 1.16.2-2 amd64 [installed]
         habanalabs-thunk/focal,focal,now 1.16.2-2 all [installed]
         habanatools/focal,now 1.16.2-2 amd64 [installed]
    
  4. Run the hl_qual test suite for hardware sanity check. hl_qual test suite includes several tests which should be run on the system. See Qualification Tool Library Guide (hl_qual) for the exact procedures and the prerequisite steps. To confirm that all hardware components function and interact with each other, run the following test first:

    $ ./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 60