Optimizing Training Platform

The following optimization methods should be applied in your on-premise environment.

Ensure MPI Run Command Includes Core/Socket Binding/Mapping to the Right Number of Cores

Note

This optimization method also applies for AWS DL1.

--bind-to core --map-by socket:PE=6

This is to ensure CPU affinity is done correctly for your processes and core allocation is distributed as well. In this example, 6 cores are allocated for one process, which controls one Gaudi in the same socket as the 6 cores. The CPU affinity can improve performance not only for distributed training using multiple Gaudi devices in one server, but also for training using a single Gaudi device.

This number can be calculated by using the number of CPU(s). the Thread(s) per core (these 2 numbers are from the command `lscpu`), and the number of Gaudi devices (for example 8 for the Intel® Gaudi® HLS-1 platform, other servers may have 4 devices). In the below example, there are 96 CPU(s) and 2 Thread(s) per core, therefore, 96 / 2 / 8 = 6 cores per Gaudi.

This sample code can also be used to calculate the number of physical CPU cores and HPU count to generate the appropriate PE value, shown as MPI_PE below. This can be incorporated into any model:

export PHY_CPU_COUNT=$(lscpu --all --parse=CORE,SOCKET | grep -Ev "^#" | sort -u | wc -l)
export PHY_HPU_COUNT=$(ls /dev/hl? | wc -l)
export MPI_PE=$(($PHY_CPU_COUNT/$PHY_HPU_COUNT))

The PE value in the Model-References examples may be set to a common number to ensure functionality, but depending on the Host CPU, the directions above should be used for optimal system performance.

Set CPU Setting to Performance

The below is an example of setting the CPU to performance for Ubuntu:

#Get setting:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

#Set setting:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Use Low Latency Kernel

There is a 15-20% performance drop on generic kernel in case of BERT, compared with low latency kernel (5.4.0-72-lowlatency), for example, in Ubuntu 18.04. You can use the command line below to install low latency kernel:

sudo apt install linux-lowlatency-hwe-18.04

Ensure FW/SW Packages/BMC/CPLD are the Latest

Refer to Installation Guide to check for the latest FW/SW Packages/BMC/CPLD versions and make sure they are installed properly.

Ensure Docker Run Command Used is the Latest

Note

This optimization method also applies for AWS DL1.

Refer to Installation Guide to make sure docker run command is installed accordingly.

Ensure Dataset/Model Code/Output are Placed on High Performing Hard Drive (NVME/SSD)

For best performance, use NVME for all datasets, model code, and output locations when running the training.

Ensure Network ifs for Gaudi Cards Have Static IP

Set the Gaudi card ifs to have static IPs by running the below command:

./manage_network_ifs.sh --set-ip

For further details, refer to manage_network_ifs.sh.