Scale-Out via Host-NIC
On this Page
Scale-Out via Host-NIC¶
Multi-node scale-out for Gaudi accelerator devices via host NIC interfaces is enabled. This functionality is provided by HCCL library and activated either by an internal auto-detection mechanism or explicit selection by the user. Data transfer between nodes is supported with Scale-Out via Host-NIC over OFI mode.
Scale-Out Auto-detection¶
By default, HCCL runs internal auto-detection and selection logic to select and use the most efficient method for scaling between nodes. Mode selection is done with the below priorities:
Priority |
Scale-Out Mode |
Conditions |
---|---|---|
P0 |
Gaudi NICs |
Gaudi scale-out NICs are connected. |
P1 |
Host NIC Gaudi Direct (GDR) |
|
P2 |
Host NICs using host memory |
libfabric is available with one of the following providers: “efa”, “verbs” or “tcp”. To use “verbs” provider, refer to Enabling InfiniBand NICs (Verbs) for Host NIC Scaling. |
P3 |
No Scale-Out |
Note
HCCL auto-detection mechanism is disabled when setting HCCL_OVER_OFI env variable (force P2).
Host NIC GDR with Verbs provider is supported on Gaudi2 only.
Host NIC Gaudi Direct Setup¶
To enable Host NIC Gaudi Direct on AWS EFA, set
RDMAV_FORK_SAFE=1
andFI_EFA_USE_DEVICE_RDMA=1
environment variables.To enable Host NIC Gaudi Direct on Verbs:
Set
RDMAV_FORK_SAFE=1
andMLX5_SCATTER_TO_CQE=0
environment variables.Disable PCIe Access Control (ACS).
Use Habana proprietary libfabric.
Build libfabric using
--with-synapseai
configuration option.
Scale-Out via Host-NIC over OFI¶
HCCL interacts with libFabric to utilize any underlying HW and networking mode.
Configuration Knobs¶
HCCL exports several environment variables that control the behavior of scale-out communication over libFabric. The table below lists the available environment variables.
Environment Variable |
Description |
---|---|
HCCL_SOCKET_IFNAME |
Identifies the network interface(s) that should be used for scale-out comms. |
HCL_COMM_ID |
Identifies the root process (rank 0) of the global communicator group. Typically set to <IPaddress:port> the IP address of the network interface used by the root. This must be set for all HCCL processes when there is no alternate network to broadcast this. |
Usage¶
Download HCCL OFI Wrapper.
Build and install libFabric.
Build HCCL OFI Wrapper.
Note
See additional instructions in HCCL OFI Wrapper page.
Note
If you are using Habana Containers, the above steps are not required as OFI Wrapper and libFabric are already installed by default.