Platform Upgrade and Full System Installation
On this Page
Platform Upgrade and Full System Installation¶
If you are an IT Administrator and would like to perform a full system installation, the following steps should be followed. This assumes that the physical system is fully assembled, the OS is loaded, and the Software is ready to be installed:
BMC Access¶
Ensure access to the BMC via IP address, whether the BMC IP is set to static or dynamic:
BMC enables access to the CPU subsystem and Gaudi2 subsystem (HIB/UBB).
HLS-2 has two physical RJ45 ports for BMC access and two IP addresses; each for CPU and Gaudi2 system.
Supermicro server has one physical RJ45 port and one IP to access CPU and Gaudi2 systems.
Check Gaudi on the Platform¶
Check if all eight Gaudi cards are visible on the system by running the lspci command below:
$ lspci -d :1020: -nn 19:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) 1a:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) 43:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) 44:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) b3:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) b4:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) cc:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01) cd:00.0 Processing accelerators [1200]: Habana Labs Ltd. Device [1da3:1020] (rev 01)
Check and/or Upgrade the System FW Components¶
First, check the platform level components: CPLD, Platform BIOS, and BMC Firmware. To upgrade the CPLD, refer to the Gaudi2 HL225 Porting Guide available here. For the platform BIOS and BMC Firmware, refer to your system vendor documentation.
This first part contains the habanalabs-firmware-odm
which installs the Gaudi2 firmware. See Support Matrix for the supported FW versions.
When running on HLS:
Make sure the CPLD version is updated. To upgrade your CPLD, refer to the Gaudi2 HL225 Porting Guide document available here.
Make sure that your PCI switch version is updated. Refer to your system vendor documentation for details.
Note
Please contact your local Habana or Intel support representative if you do not have access to the Gaudi2 HL225 Porting Guide.
Artifactory Access Token¶
Tokens for automated access to Artifactory can be generated and used as follows:
Go to the Artifacts page.
Click on the Welcome button on the upper right corner of the page.
Select Edit Profile and generate an API Key.
Copy the API Key to use for later.
Ubuntu - Package Installation¶
Installing the package with internet connection available allows the network to download and install the required dependencies for the SynapseAI package (apt get and pip install etc.).
Note
Running the below commands installs the latest version only. You can install a version other than latest by running the below commands with a specific build number.
Package Retrieval¶
Package Retrieval:
Download and install the public key:
curl -X GET https://vault.habana.ai/artifactory/api/gpg/key/public | sudo apt-key add --
Get the name of the operating system:
lsb_release -c | awk '{print $2}'
Create an apt source file /etc/apt/sources.list.d/artifactory.list with deb https://vault.habana.ai/artifactory/debian <OS name from previous step> main content.
Update Debian cache:
sudo dpkg --configure -a
sudo apt-get update
Firmware Verification and Installation¶
To see the condition of the Gaudi SPI Firmware, you can run the following command:
hl-smi -L | grep SPI
At the system level the requirement is to update the SPI firmware only. Refer to the Support Matrix for the exact version. The below shows the expected output for the 1.12.1-10 release:
Firmware [SPI] Version : Preboot version hl-gaudi2-1.12.1-fw-46.0.5-sec-5 (Jul 20 2023 - 17:57:23)
You can install the firmware using the following command (be sure to point the location where the FW was downloaded from the vault):
sudo apt install -y ./habanalabs-firmware-odm-1.13.0-463.amd64.deb
Update FW¶
To update the firmware, run the following command:
sudo hl-fw-loader
Note
To update firmware on your system, make sure to remove write protect for burning SPI components. Removing write protect is outside the scope of this document.
Upgrade eROM¶
When upgrading the firmware on Gaudi2, the eROM should be also upgraded. Before running the procedure, make sure you have the following:
Root privileges
BMC access
“gaudi2-agent-fw_loader-fit_erom.itb” file
Note
Upgrading eROM is required if you are not using the latest eROM version.
Refer to Support Matrix for the latest eROM version.
To verify the installed eROM version, run sudo hl-smi --fw-version
.
To upgrade the eROM, perform the following:
Unload the drivers. If the
habanalabs-dkms
driver is already installed, the drivers must be unloaded before eROM update:sudo modprobe -r habanalabs && sudo modprobe -r habanalabs_cn sudo modprobe -r habanalabs_ib && sudo modprobe -r habanalabs_en
Disable eROM write protection by writing the value of 0x2e to address 8 of each one of the OAM CPLDs.
Upgrade the eROM by running the following command (be sure to point to the file location of the .itb file):
hl-fw-loader -f ./path_to_file/gaudi2-agent-fw_loader-fit_erom.itb
Enable eROM write protection by writing the value of 0x26 to address 8 of each of the OAM CPLDs.
Software Install and Update¶
Update the SW driver, SynapseAI components and Habana specific environment. See SynapseAI Software Stack and Driver Installation.
Update Environment Variables and More¶
When the installation is complete, close the shell and re-open it. Or, run the following:
source /etc/profile.d/habanalabs.sh
source ~/.bashrc
Update EEPROM (for HLS Users)¶
Updating the EEPROM is required for HLS users only. EEPROM burning is outside the scope of this document.
Driver Verification¶
Run lsmod to verify the driver is loaded and running:
$ lsmod | grep habana
habanalabs 1572864 0
habanalabs_cn 454656 8
habanalabs_ib 73728 8
habanalabs_en 61440 8
System Verifications and Final Tests¶
Run hl-smi and verify the following is shown. Ensure the driver version in hl-smi output matches the installed SynapseAI versions. Ensure the temperature (under “Temp column in hl-smi output) reflects non-zero value. If the temperature output is “0C”, then there is a problem during the card initialization. In this case, reboot the system and/or look over the driver installation steps were correct.
Run dmesg and ensure no errors are reported:
$ dmesg | grep habana
Re-check if the SPI FW version matches the SynapseAI driver version by running hl-smi below:
$ hl-smi -L | grep SPI Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23) Firmware [SPI] Version: Preboot version hl-gaudi2-1.13.0-fw-46.2.0-sec-6 (Jul 20 2023 - 17:57:23)
Re-check all SW components by running the apt list command below:
$ apt list --installed | grep habana habanalabs-container-runtime/focal,now 1.13.0-463 amd64 [installed] habanalabs-dkms/focal,focal,now 1.13.0-463 all [installed] habanalabs-firmware-tools/focal,now 1.13.0-463 amd64 [installed] habanalabs-firmware/focal,now 1.13.0-463 amd64 [installed] habanalabs-graph/focal,now 1.13.0-463 amd64 [installed] habanalabs-qual/focal,now 1.13.0-463 amd64 [installed] habanalabs-thunk/focal,focal,now 1.13.0-463 all [installed] habanatools/focal,now 1.13.0-463 amd64 [installed]
Run the hl_qual test suite for hardware sanity check. hl_qual test suite includes several tests which should be run on the system; this the following test should be run first to get a basic confirmation that the test. See Qualification Library Guide (hl_qual Tool) for the exact procedures and the prerequisite steps.
$ ./hl_qual -gaudi2 -c all -rmod parallel -f2 -l extreme -t 60