Platform Upgrade and Full System Installation

If you are an IT Administrator and would like to perform a full system installation, the following steps should be followed. This assumes that the physical system is fully assembled, the OS is loaded, and the software is ready to be installed:

BMC Access

Ensure access to the BMC via IP address, whether the BMC IP is set to static or dynamic:

  • BMC enables access to the CPU subsystem and Intel® Gaudi® 2 AI accelerator subsystem (HIB/UBB).

  • HLS-2 has two physical RJ45 ports for BMC access and two IP addresses; each for CPU and Gaudi 2 system.

  • Supermicro server has one physical RJ45 port and one IP to access CPU and Gaudi 2 systems.

Check Gaudi on the Platform

Check if all eight Gaudi cards are visible on the system by running the lspci command below:

$ lspci -d :1020: -nn
19:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
1a:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
43:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
44:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
b3:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
b4:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
cc:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
cd:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)

Check and/or Upgrade the System FW Components

First, check the platform level components: CPLD, Platform BIOS, and BMC firmware. To upgrade the CPLD, refer to the Gaudi 2 HL225 Porting Guide available here. For the platform BIOS and BMC firmware, refer to your system vendor documentation.

This first part contains the habanalabs-firmware-odm which installs the Gaudi 2 firmware. See Support Matrix for the supported FW versions.

When running on HLS:

  1. Make sure the CPLD version is updated. To upgrade your CPLD, refer to the Gaudi 2 HL225 Porting Guide document available here.

  2. Make sure that your PCI switch version is updated. Refer to your system vendor documentation for details.

Note

Please contact your local Intel Gaudi support representative if you do not have access to the Gaudi 2 HL225 Porting Guide.

Artifactory Access Token

Tokens for automated access to Artifactory can be generated and used as follows:

  1. Go to the Artifacts page.

  2. Click on the Welcome button on the upper right corner of the page.

  3. Select Edit Profile and generate an API Key.

  4. Copy the API Key to use for later.

Ubuntu - Package Installation

Installing the package with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get and pip install etc.).

Note

Running the below commands installs the latest version only. You can install a version other than latest by running the below commands with a specific build number.

Package Retrieval

Package Retrieval:

  1. Download and install the public key:

curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
  1. Get the name of the operating system:

lsb_release -c | awk '{print $2}'
  1. Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.

  2. Update Debian cache:

sudo dpkg --configure -a

sudo apt-get update

Firmware Verification and Installation

To see the condition of the Gaudi SPI firmware, you can run the following command:

hl-smi -L | grep SPI

At the system level the requirement is to update the SPI firmware only. Refer to the Support Matrix for the exact version. The below shows the expected output for the 1.15.0-479 release:

 Firmware [SPI] Version          : Preboot version hl-gaudi2-1.15.0-fw-48.2.1-sec-8 (Jul 20 2023 - 17:57:23)

You can install the firmware using the following command (be sure to point the location where the FW was downloaded from the Intel Gaudi vault):

 sudo apt install -y ./habanalabs-firmware-odm-1.15.1-15.amd64.deb

Update FW

To update the firmware, run the following command:

sudo hl-fw-loader

Note

To update firmware on your system, make sure to remove write protect for burning SPI components. Removing write protect is outside the scope of this document.

Upgrade eROM

When upgrading the firmware on Gaudi 2, the eROM should be also upgraded. Before running the procedure, make sure you have the following:

  • Root privileges

  • BMC access

  • “gaudi2-agent-fw_loader-fit_erom.itb” file

Note

Upgrading eROM is required if you are not using the latest eROM version. Refer to Support Matrix for the latest eROM version. To verify the installed eROM version, run sudo hl-smi --fw-version.

To upgrade the eROM, perform the following:

  1. Unload the drivers. If the habanalabs-dkms driver is already installed, the drivers must be unloaded before eROM update:

    sudo modprobe -r habanalabs && sudo modprobe -r habanalabs_cn
    sudo modprobe -r habanalabs_ib && sudo modprobe -r habanalabs_en
    
  2. Disable eROM write protection by writing the value of 0x2e to address 8 of each one of the OAM CPLDs.

  3. Upgrade the eROM by running the following command (be sure to point to the file location of the .itb file):

    hl-fw-loader -f ./path_to_file/gaudi2-agent-fw_loader-fit_erom.itb
    
  4. Enable eROM write protection by writing the value of 0x26 to address 8 of each of the OAM CPLDs.

Software Install and Update

Update the SW driver, Intel Gaudi software components and Gaudi specific environment. See Intel Gaudi Software Stack and Driver Installation.

Update Environment Variables and More

When the installation is complete, close the shell and re-open it. Or, run the following:

source /etc/profile.d/habanalabs.sh

source ~/.bashrc

Update EEPROM (for HLS Users)

Updating the EEPROM is required for HLS users only. EEPROM burning is outside the scope of this document.

Driver Verification

Run lsmod to verify the driver is loaded and running:

$ lsmod | grep habana
habanalabs           1572864  0
habanalabs_cn         454656  8
habanalabs_ib          73728  8
habanalabs_en          61440  8

System Verifications and Final Tests

  1. Run hl-smi and verify the following is shown. Ensure the driver version in hl-smi output matches the installed Intel Gaudi software versions. Ensure the temperature (under “Temp column in hl-smi output) reflects non-zero value. If the temperature output is “0C”, then there is a problem during the card initialization. In this case, reboot the system and/or look over the driver installation steps were correct.

../_images/hl_smi_report.png
  1. Run dmesg and ensure no errors are reported:

    $ dmesg | grep habana
    
  2. Re-check if the SPI FW version matches the Intel Gaudi software driver version by running hl-smi below:

         $ hl-smi -L | grep SPI
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
         Firmware [SPI] Version: Preboot version hl-gaudi2-1.15.1-fw-49.0.0-sec-8 (Jul 20 2023 - 17:57:23)
    
  3. Re-check all SW components by running the apt list command below:

         $ apt list --installed | grep habana
         habanalabs-container-runtime/focal,now 1.15.1-15 amd64 [installed]
         habanalabs-dkms/focal,focal,now 1.15.1-15 all [installed]
         habanalabs-firmware-tools/focal,now 1.15.1-15 amd64 [installed]
         habanalabs-firmware/focal,now 1.15.1-15 amd64 [installed]
         habanalabs-graph/focal,now 1.15.1-15 amd64 [installed]
         habanalabs-qual/focal,now 1.15.1-15 amd64 [installed]
         habanalabs-thunk/focal,focal,now 1.15.1-15 all [installed]
         habanatools/focal,now 1.15.1-15 amd64 [installed]
    
  4. Run the hl_qual test suite for hardware sanity check. hl_qual test suite includes several tests which should be run on the system; this the following test should be run first to get a basic confirmation that the test. See Qualification Library Guide (hl_qual Tool) for the exact procedures and the prerequisite steps.

    $ ./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 60