Scale-Out via Host-NIC

Multi-node scale-out for Gaudi accelerator devices via host NIC interfaces is enabled. This functionality is provided by HCCL library and activated either by an internal auto-detection mechanism or explicit selection by the user. Data transfer between nodes is supported with Scale-Out via Host-NIC over OFI mode.

Scale-Out Auto-detection

By default, HCCL runs internal auto-detection and selection logic to select and use the most efficient method for scaling between nodes. Mode selection is done with the below priorities:

Priority

Scale-Out Mode

Conditions

P0

Gaudi NICs

Gaudi scale-out NICs are connected

P1

Host NIC peer-direct

  • AWS EFA OFI provider is available.

  • libfabric version 1.16.0 or later

  • Linux kernel version 5.12 or later

P2

Host NICs using host memory

libfabric is available with one of the following providers: “efa”, “verbs” or “tcp”

P3

No Scale-Out

Note

Scale-Out via Host-NIC over OFI

HCCL interacts with libFabric to utilize any underlying HW and networking mode.

Configuration Knobs

HCCL exports several environment variables that control the behavior of scale-out communication over libFabric. The table below lists the available environment variables.

Environment Variable

Description

HCCL_SOCKET_IFNAME

Identifies the network interface(s) that should be used for scale-out comms.

HCL_COMM_ID

Identifies the root process (rank 0) of the global communicator group. Typically set to <IPaddress:port> the IP address of the network interface used by the root. This must be set for all HCCL processes when there is no alternate network to broadcast this.

Usage

  1. Download HCCL OFI Wrapper.

  2. Build and install libFabric.

  3. Build HCCL OFI Wrapper.

Note

See additional instructions in HCCL OFI Wrapper page.

Note

If you are using Habana Containers, the above steps are not required as OFI Wrapper and libFabric are already installed by default.