Scale-out via Host NIC
On this Page
Scale-out via Host NIC¶
Multi-server scale-out for Intel® Gaudi® AI accelerators via Host NIC interfaces is enabled. This functionality is provided by HCCL library and activated either by an internal auto-detection mechanism or explicit selection by the user. Data transfer between nodes is supported with Scale-out via Host NIC over OFI mode.
Scale-out Auto-detection¶
By default, HCCL runs internal auto-detection and selection logic to select and use the most efficient method for scaling between nodes. Mode selection is done with the below priorities:
Priority |
Scale-Out Mode |
Conditions |
---|---|---|
P0 |
Gaudi NICs |
Gaudi scale-out NICs are connected. |
P1 |
Host NIC Gaudi Direct (GDR) |
|
P2 |
Host NICs using host memory |
libfabric is available with one of the following providers: “efa”, “verbs” or “tcp”. To use verbs provider, refer to Enabling InfiniBand NICs (Verbs) for Host NIC Scaling. |
P3 |
No scale-out |
Note
HCCL auto-detection mechanism is disabled when setting
HCCL_OVER_OFI
environment variable (force P2).Host NIC GDR with verbs provider is supported on Gaudi 3 and Gaudi 2.
Host NIC Gaudi Direct Setup¶
To enable Host NIC Gaudi Direct on AWS EFA, set
RDMAV_FORK_SAFE=1
andFI_EFA_USE_DEVICE_RDMA=1
environment variables.To enable Host NIC Gaudi Direct with verbs provider:
Set
RDMAV_FORK_SAFE=1
andMLX5_SCATTER_TO_CQE=0
environment variables.Disable PCIe Access Control (ACS).
Use libfabric version 1.20.0.
Build libfabric using
--with-synapseai
configuration option.
Scale-out via Host NIC over OFI¶
HCCL interacts with libfabric to utilize any underlying HW and networking mode.
Configuration Knobs¶
HCCL exports several environment variables that control the behavior of scale-out communication over libfabric. The table below lists the available environment variables.
Environment Variable |
Description |
---|---|
|
Identifies the network interface(s) that should be used for scale-out comms. |
|
Identifies the root process (rank 0) of the global communicator group. Typically set to <IPaddress:port> - the IP address of the network interface used by the root. This must be set for all HCCL processes when there is no alternate network to broadcast this. |
Using Host NIC over OFI¶
Download HCCL OFI Wrapper.
Build and install libfabric.
Build the HCCL OFI Wrapper.
Note
See additional instructions in HCCL OFI Wrapper page.
The above steps are not required when running Intel Gaudi containers as OFI Wrapper and libfabric are already installed by default.