Enabling InfiniBand NICs (Verbs) for Host NIC Scaling

Enabling Verbs

In order to use verbs-providers, the following must be performed:

  • Re-configuring libfabric with verbs enable.

  • Installing UCX package to allow communication via InfiniBand.

  • Re-configuring MPI with the UCX package and verbs support.

Reconfiguring libfabric

For a clean installation, you should remove the environment variables which point to the local MPI and libfabric packages, and create a new directory:

  1. Remove the local installations:

    %>rm -rf /opt/amazon/openmpi
    %>rm -rf /opt/amazon/efa
    
  2. Unset MPI environment variables:

    %> unset MPICC
    %> unset OPAL_PREFIX
    %> unset MPI_ROOT
    

In case there are other environment variables which point to /opt/amazon/openmpi/ or /opt/amazon/efa/, re-set those variables. Make sure to save its original context for exporting MPI environment variables to the new MPI location.

  1. Make a clean directory for new the installations or use /opt/amazon:

    %> export NEW_PKGS_DIR=<your new, clean directory  or /opt/amazon>
    %> mkdir $NEW_PKGS_DIR
    
  2. Re-install and configure libfabric with verbs support. Please note the below libfabric version is an example version only. Make sure to use the libfabric version recommended for your environment:

    %>wget https://github.com/ofiwg/libfabric/releases/download/v1.16.1/libfabric-1.16.1.tar.bz2 -P /tmp/lib
    %>cd /tmp/lib
    %>tar -xf ./libfabric-1.16.1.tar.bz2
    %>cd ./libfabric-1.16.1
    %>./configure --prefix=$NEW_PKGS_DIR/efa/ --enable-psm3-verbs --enable-verbs=yes --enable-debug
    %>make
    %>make install
    

Make sure that fi_info -l presents the verbs option:

%> $CONNECTION_DIR/efa/bin/fi_info -l
usnic:
    version: 1.0
verbs:                     <-------------------------
    version: 116.10        <-------------------------
ofi_rxm:
    version: 116.10
ofi_rxd:
    version: 116.10
shm:
    version: 116.10
udp:
    version: 116.10
tcp:
    version: 116.10
sockets:
    version: 116.10
net:
    version: 116.10
ofi_hook_perf:
    version: 116.10
ofi_hook_debug:
    version: 116.10
ofi_hook_noop:
    version: 116.10
ofi_hook_hmem:
    version: 116.10
ofi_hook_dmabuf_peer_mem:
    version: 116.10
ofi_mrail:
    version: 116.10

Install UCX Package

  1. Install the required Linux packages, libtool and autoconf:

    %> sudo apt update
    %> sudo apt-get install libtool
    %> sudo apt-get install autoconf
    
  2. Install the UCX package:

    %> wget https://github.com/openucx/ucx/releases/download/v1.13.1/ucx-1.13.1.tar.gz -P /tmp/ucx
    %> cd /tmp/ucx
    %> tar -xf ./ucx-1.13.1.tar.gz
    %> cd  ./ucx-1.13.1
    %>./configure --prefix=$NEW_PKGS_DIR/ucx
    %> make
    %> make install
    

Reconfiguring MPI

  1. Install the Open MPI package. Please note the below Open MPI version is an example version only. Make sure to use the Open MPI version recommended for your environment:

    %>wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.bz2 -P /tmp/openmpi
    %>cd /tmp/openmpi
    %>tar -xf ./openmpi-4.1.4.tar.bz2
    %>cd ./openmpi-4.1.4
    %>./configure --prefix=$NEW_PKGS_DIR/mpi --with-sge --disable-builtin-atomics --enable-orterun-prefix-by-default --with-ucx=$NEW_PKGS_DIR/ucx  --with-verbs
    %>make
    %>make install
    
  2. Export the MPI environment variables to point to the new MPI location:

    %>export MPICC=$NEW_PKGS_DIR/mpi/bin/mpicc
    %>export OPAL_PREFIX=$NEW_PKGS_DIR/mpi
    %>export MPI_ROOT=$NEW_PKGS_DIR/mpi
    

Note

If some of your environment variable were re-set to not include local mpi/efa installation, reset them now with the new installation:

%>LD_LIBRARY_PATH=$NEW_PKGS_DIR/mpi/lib:$NEW_PKGS_DIR/efa/lib:$LD_LIBRARY_PATH
%>PATH=$NEW_PKGS_DIR/mpi/bin:$NEW_PKGS_DIR/efa/bin:$PATH

Installing Mellanox OFED (MLNX_OFED) Driver

For a successful installation of the Mellanox driver, Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) must be installed on the system. MLNX_OFED plays a vital role in ensuring proper functionality and compatibility with the Mellanox driver.

If MLNX_OFED is not installed on your system, the below lists the required steps. However, depending on your system’s configuration and requirements, there may be additional steps or dependencies to consider. You can also refer to https://enterprise-support.nvidia.com/s/article/howto-install-mlnx-ofed-driver.

  1. Check if MLNX_OFED is installed by running ofed_info. If the command is not found, it means MLNX_OFED is not installed.

  2. To install MLNX_OFED, execute the following command, replacing the XXXX with the needed version (or the one provided by you as a customer):

    dpkg -i mlnx-ofed-kernel-dkms_5.X-OFED.5.xxxxxxx_all.deb mlnx-ofed-kernel-utils_5.x-OF.xxxxx_amd64.deb mlnx-tools_5.xxxxxx_amd64.deb
    

Note

If you have a customized version provided by you as a customer, make sure to use that version and not the public ones released by Mellanox.

  1. Once the installation is complete, reboot the machine.

  2. Install the habanalabs-dkms. See Installation Guide and On-Premise System Update.

Note

The supported MLNX_OFED versions are 5.4-1.1.1.1.8 and 5.0-2.1.8.0.