Enabling InfiniBand NICs (Verbs) for Host NIC Scaling

Enabling Verbs

In order to use verbs provider, the following must be performed:

  • Re-configuring libfabric with verbs enable.

  • Installing UCX package to allow communication via InfiniBand.

  • Re-configuring MPI with the UCX package and verbs support.

Reconfiguring Libfabric

For a clean installation, you should remove the environment variables which point to the local MPI and libfabric packages, and create a new directory:

  1. Remove the local installations:

    %>rm -rf /opt/amazon/openmpi
    %>rm -rf /opt/amazon/efa
    
  2. Unset MPI environment variables:

    %> unset MPICC
    %> unset OPAL_PREFIX
    %> unset MPI_ROOT
    

    In case there are other environment variables which point to /opt/amazon/openmpi/ or /opt/amazon/efa/, re-set those variables. Make sure to save its original context for exporting MPI environment variables to the new MPI location.

  3. Make a clean directory for new the installations or use /opt/amazon:

    %> export NEW_PKGS_DIR=<your new, clean directory  or /opt/amazon>
    %> mkdir $NEW_PKGS_DIR
    
  4. Re-install and configure libfabric with verbs support. Note that the below libfabric version is an example version only. Make sure to use the libfabric version listed in the Support Matrix:

    %>wget https://github.com/ofiwg/libfabric/releases/download/v1.16.1/libfabric-1.16.1.tar.bz2 -P /tmp/lib
    %>cd /tmp/lib
    %>tar -xf ./libfabric-1.16.1.tar.bz2
    %>cd ./libfabric-1.16.1
    %>./configure --prefix=$NEW_PKGS_DIR/efa/ --enable-psm3-verbs --enable-verbs=yes --enable-debug
    %>make
    %>make install
    
  5. Make sure that fi_info -l displays the verbs option:

    %> $CONNECTION_DIR/efa/bin/fi_info -l
    usnic:
        version: 1.0
    verbs:                     <-------------------------
        version: 116.10        <-------------------------
    ofi_rxm:
        version: 116.10
    ofi_rxd:
        version: 116.10
    shm:
        version: 116.10
    udp:
        version: 116.10
    tcp:
        version: 116.10
    sockets:
        version: 116.10
    net:
        version: 116.10
    ofi_hook_perf:
        version: 116.10
    ofi_hook_debug:
        version: 116.10
    ofi_hook_noop:
        version: 116.10
    ofi_hook_hmem:
        version: 116.10
    ofi_hook_dmabuf_peer_mem:
        version: 116.10
    ofi_mrail:
        version: 116.10
    

Installing UCX Package

  1. Install the required Linux packages, libtool and autoconf:

    %> sudo apt update
    %> sudo apt-get install libtool
    %> sudo apt-get install autoconf
    
  2. Install the UCX package:

    %> wget https://github.com/openucx/ucx/releases/download/v1.13.1/ucx-1.13.1.tar.gz -P /tmp/ucx
    %> cd /tmp/ucx
    %> tar -xf ./ucx-1.13.1.tar.gz
    %> cd  ./ucx-1.13.1
    %>./configure --prefix=$NEW_PKGS_DIR/ucx
    %> make
    %> make install
    

Reconfiguring MPI

  1. Install the Open MPI package. Note that the below Open MPI version is an example version only. Make sure to use the Open MPI version version listed in the Support Matrix:

    %>wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.4.tar.bz2 -P /tmp/openmpi
    %>cd /tmp/openmpi
    %>tar -xf ./openmpi-4.1.4.tar.bz2
    %>cd ./openmpi-4.1.4
    %>./configure --prefix=$NEW_PKGS_DIR/mpi --with-sge --disable-builtin-atomics --enable-orterun-prefix-by-default --with-ucx=$NEW_PKGS_DIR/ucx  --with-verbs
    %>make
    %>make install
    
  2. Export the MPI environment variables to point to the new MPI location:

    %>export MPICC=$NEW_PKGS_DIR/mpi/bin/mpicc
    %>export OPAL_PREFIX=$NEW_PKGS_DIR/mpi
    %>export MPI_ROOT=$NEW_PKGS_DIR/mpi
    

Note

If some environment variables were re-set without including the local MPI/EFA installation, re-set them with the new installation:

%>LD_LIBRARY_PATH=$NEW_PKGS_DIR/mpi/lib:$NEW_PKGS_DIR/efa/lib:$LD_LIBRARY_PATH
%>PATH=$NEW_PKGS_DIR/mpi/bin:$NEW_PKGS_DIR/efa/bin:$PATH

Installing Mellanox OFED (MLNX_OFED) Driver

For a successful installation of the Mellanox driver, Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) must be installed on the system. MLNX_OFED plays a vital role in ensuring proper functionality and compatibility with the Mellanox driver.

If MLNX_OFED is not installed on your system, the below lists the required steps. However, depending on your system’s configuration and requirements, there may be additional steps or dependencies to consider. You can also refer to https://enterprise-support.nvidia.com/s/article/howto-install-mlnx-ofed-driver.

  1. Check if MLNX_OFED is installed by running ofed_info. If the command is not found, it means MLNX_OFED is not installed.

  2. To install MLNX_OFED, execute the following command, replacing the XXXX with the needed version (or the one provided by you as a customer):

    dpkg -i mlnx-ofed-kernel-dkms_5.X-OFED.5.xxxxxxx_all.deb mlnx-ofed-kernel-utils_5.x-OF.xxxxx_amd64.deb mlnx-tools_5.xxxxxx_amd64.deb
    

    Note

    If you have a customized version provided by you as a customer, make sure to use that version and not the public ones released by Mellanox.

  3. Once the installation is complete, reboot the machine.

  4. Install the habanalabs-dkms as shown in the Installation Guide and On-Premise System Update.

Note

The supported MLNX_OFED versions are 5.4-1.1.1.1.8 and 5.0-2.1.8.0.