Platform Upgrade and Full System Installation
On this Page
Platform Upgrade and Full System Installation¶
The following steps guide an IT administrator through a complete system installation which includes assembling a physical system, loading the operating system, and installing the Intel® Gaudi® software and driver.
BMC Access¶
Ensure access to the BMC via IP address, whether the BMC IP is set to static or dynamic:
BMC enables access to the CPU subsystem and Gaudi 2 subsystem (HIB/UBB).
HLS-2 has two physical RJ45 ports for BMC access and two IP addresses; each for CPU and Gaudi 2 system.
Supermicro server has one physical RJ45 port and one IP to access CPU and Gaudi 2 systems.
Check Gaudi on the Platform¶
Check if all eight Gaudi cards are visible on the system by running the lspci
command below:
$ lspci -d :1020: -nn 19:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) 1a:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) 43:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) 44:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) b3:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) b4:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) cc:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) cd:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
Check and/or Upgrade the System Firmware Components¶
Before installing the firmware with habanalabs-firmware-odm
, check that the following platform level components are updated:
CPLD - To upgrade the CPLD, refer to the Porting Guide available here.
Platform BIOS, and BMC FW - Refer to your system vendor documentation for details.
(HLS-2 only) PCIe switch version - Refer to your system vendor documentation for details.
See Support Matrix for the supported versions.
Note
Contact your local Intel Gaudi support representative if you do not have access to the Gaudi 2 HL225 Porting Guide.
Artifactory Access Token¶
Tokens for automated access to Artifactory can be generated and used as follows:
Go to the Artifacts page.
Click on the Welcome button on the upper right corner of the page.
Select Edit Profile and generate an API Key.
Copy the API Key to use for later.
Operating System Packages Installation¶
Installing the packages with internet connection available allows the network to download and install the required dependencies for the Intel Gaudi software package (apt get and pip install etc.).
Note
Running the below commands installs the latest version only. You can install a version other than the latest by running the below commands with a specific build number.
Package Retrieval:
Download and install the public key:
curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
Get the name of the operating system:
lsb_release -c | awk '{print $2}'
Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.
Update Debian cache:
sudo dpkg --configure -a sudo apt-get update
Note
Amazon Linux 2 installation is available on first-gen Gaudi only.
Package Retrieval:
Create /etc/yum.repos.d/Habana-Vault.repo with the following content:
[vault] name=Habana Vault baseurl=https://vault.habana.ai/artifactory/AmazonLinux2 enabled=1 gpgcheck=0 gpgkey=https://vault.habana.ai/artifactory/AmazonLinux2/repodata/repomod.xml.key repo_gpgcheck=0
Update YUM cache by running the following command:
sudo yum makecache
Verify correct binding by running the following command:
yum search habana
This command searches for and lists all packages with the word Habana.
Note
RHEL8.6 installation is available on Gaudi 2 only.
Package Retrieval:
Create /etc/yum.repos.d/Habana-Vault.repo with the following content:
[vault] name=Habana Vault baseurl=https://vault.habana.ai/artifactory/rhel/8/8.6 enabled=1 repo_gpgcheck=0
Update YUM cache by running the following command:
sudo yum makecache
Verify correct binding by running the following command:
yum search habana
This will search for and list all packages with the word Habana.
Reinstall libarchive package by following command:
sudo dnf install -y libarchive*
Note
RHEL9.2 installation is available on Gaudi 2 only.
Package Retrieval:
Create /etc/yum.repos.d/Habana-Vault.repo with the following content:
[vault] name=Habana Vault baseurl=https://vault.habana.ai/artifactory/rhel/9/9.2 enabled=1 repo_gpgcheck=0
Update YUM cache by running the following command:
sudo yum makecache
Verify correct binding by running the following command:
yum search habana
This command searches for and lists all packages with the word Habana.
Reinstall libarchive package by following command:
sudo dnf install -y libarchive*
Note
Debian 10.10 installation is available on Gaudi 2 only.
Package Retrieval:
Download and install the public key:
curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
Get the name of the operating system:
lsb_release -c | awk '{print $2}'
Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.
Update Debian cache:
sudo dpkg --configure -a sudo apt-get update
Note
TencentOS 3.1 installation is available on Gaudi 2 only.
Package Retrieval:
Create /etc/yum.repos.d/Habana-Vault.repo with the following content:
[vault] name=Habana Vault baseurl=https://vault.habana.ai/artifactory/tencentos/3/3.1 enabled=1 repo_gpgcheck=0
Update YUM cache by running the following command:
sudo yum makecache
Verify correct binding by running the following command:
yum search habana
This command searches for and lists all packages with the word Habana.
Reinstall libarchive package by following command:
sudo dnf install -y libarchive*
Firmware Verification and Installation¶
Verify the Gaudi SPI FW version by running the following command:
hl-smi -L | grep SPI
Only the SPI FW needs to be updated at the system level. Refer to the Support Matrix for the exact version. The below shows the expected output for the 1.16.1-7 release:
Firmware [SPI] Version : Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
Install the FW by running the following command. Point to the Intel Gaudi vault location where the FW package was downloaded:
sudo apt install -y ./path_to_file/habanalabs-firmware-odm-1.16.2-2.amd64.deb
eROM Upgrade¶
When upgrading the FW on Gaudi 2, the eROM should also be upgraded. Before running the procedure, make sure you have the following:
Root privileges
BMC access
“gaudi2-agent-fw_loader-fit_erom.itb” file
Note
Upgrading the eROM is required if you are not using the latest eROM version.
Refer to Support Matrix for the latest eROM version.
To verify the installed eROM version, run sudo hl-smi --fw-version
.
To upgrade the eROM, perform the following:
Unload the drivers. If the
habanalabs-dkms
driver is already installed, the drivers must be unloaded before eROM update:sudo modprobe -r habanalabs && sudo modprobe -r habanalabs_cn sudo modprobe -r habanalabs_ib && sudo modprobe -r habanalabs_en
Disable the eROM write protection by writing the value of 0x2e to address 8 of each one of the OAM CPLDs.
Upgrade the eROM by running the following command. Point to the location of the
.itb
file:hl-fw-loader -f ./path_to_file/gaudi2-agent-fw_loader-fit_erom.itb
Enable the eROM write protection by writing the value of 0x26 to address 8 of each of the OAM CPLDs.
Software Stack and Driver Installation¶
Install the SW driver, Intel Gaudi SW components, and Gaudi-specific environment by following the procedure in Intel Gaudi Software Stack and Driver Installation.
Environment Variables and Configurations Update¶
When the installation is complete, close the shell and re-open it. Or, run the following:
source /etc/profile.d/habanalabs.sh
source ~/.bashrc
EEPROM Update for HLS-2 Users¶
Updating the EEPROM is required for HLS-2 users only. EEPROM burning is not covered in this document.
System Verifications and Final Tests¶
Run
lsmod
to verify the driver is loaded and running:$ lsmod | grep habana habanalabs 1572864 0 habanalabs_cn 454656 8 habanalabs_ib 73728 8 habanalabs_en 61440 8
Run
hl-smi
and verify the following is shown. Verify that the driver version in thehl-smi
output matches the installed Intel Gaudi software versions and that the temperature (“Temp” column in the output) reflects non-zero value. If the temperature output is “0C”, then there is a problem during the card initialization. In this case, reboot the system and/or verify that the driver installation steps were correct.
Run
dmesg
and ensure no errors are reported:$ dmesg | grep habana
Re-check if the SPI FW version prefix matches the Intel Gaudi software driver version prefix by running the
hl-smi
command below:$ hl-smi -L | grep SPI Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.16.0-fw-50.1.2-sec-8 (Jul 20 2023 - 17:57:23)
Re-check all SW components by running the
apt list
command below:$ apt list --installed | grep habana habanalabs-container-runtime/focal,now 1.16.2-2 amd64 [installed] habanalabs-dkms/focal,focal,now 1.16.2-2 all [installed] habanalabs-firmware-tools/focal,now 1.16.2-2 amd64 [installed] habanalabs-firmware/focal,now 1.16.2-2 amd64 [installed] habanalabs-graph/focal,now 1.16.2-2 amd64 [installed] habanalabs-qual/focal,now 1.16.2-2 amd64 [installed] habanalabs-thunk/focal,focal,now 1.16.2-2 all [installed] habanatools/focal,now 1.16.2-2 amd64 [installed]
Run the hl_qual test suite for hardware sanity check. hl_qual test suite includes several tests which should be run on the system. See Qualification Tool Library Guide (hl_qual) for the exact procedures and the prerequisite steps. To confirm that all hardware components function and interact with each other, run the following test first:
$ ./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 60