Scale-Out via Host-NIC

Multi-node scale-out for Gaudi accelerator devices via host NIC interfaces is enabled. This functionality is provided by HCCL library. Data transfer between nodes is supported with Scale-Out via Host-NIC over OFI mode.

Scale-Out via Host-NIC over OFI

HCCL interacts with libFabric to utilize any underlying HW and networking mode.

Configuration Knobs

HCCL exports several environment variables that control the behavior of scale-out communication over libFabric. The table below lists the environment variables needed.

Environment Variable

Description

HCCL_OVER_OFI

Enables scale-out communications over OFI libFabric. Possible values are 0 (disable) or 1 (enable).Default value is 0.

HCCL_SOCKET_IFNAME

Identifies the network interface(s) that should be used for scale-out comms.

HCL_COMM_ID

Identifies the root process (rank 0) of the global communicator group. Typically set to <IPaddress:port> the IP address of the network interface used by the root. This must be set for all HCCL processes when there is no alternate network to broadcast this.

Usage

  1. Download HCCL OFI Wrapper.

  2. Build and install libFabric.

  3. Build HCCL OFI Wrapper.

Note

If you are using Habana Containers, steps 2 is not required as libFabric is already installed by default.

  1. To check the HCCL OFI Wrapper built, run your test while setting the environment variable HCCL_OVER_OFI=1.