Optimizing Training Platform

The following optimization methods should be applied in your on-premise environment.

Ensure MPI Command Binds Cores/Socket to Correct Number

You can use this method to ensure CPU affinity and core allocation are correctly done for your processes. In the below example, six cores are allocated for one process, which controls one Gaudi in the same socket as the six cores:

--bind-to core --map-by socket:PE=6

Note

This optimization method also applies for AWS DL1.

The CPU affinity can improve performance not only for distributed training using multiple Gaudi devices in one server, but also for training using a single Gaudi device.

This number can be calculated by using the number of CPU(s) thread(s) per core (these two numbers are from the `lscpu` command), and the number of Gaudi devices. For example the HLS-1 platform contains eight devices while other servers may have four devices. In the below example, there are 96 CPU(s) and 2 thread(s) per core, therefore, 96 / 2 / 8 = 6 cores per Gaudi.

This sample code can also be used to calculate the number of physical CPU cores and HPU count to generate the appropriate PE value, shown as MPI_PE below. This can be incorporated into any model:

export PHY_CPU_COUNT=$(lscpu --all --parse=CORE,SOCKET | grep -Ev "^#" | sort -u | wc -l)
export PHY_HPU_COUNT=$(ls /dev/hl? | wc -l)
export MPI_PE=$(($PHY_CPU_COUNT/$PHY_HPU_COUNT))

The PE value in the Model-References examples may be set to a common number to ensure functionality, but the directions above should be used for optimal system performance depending on the Host CPU.

Set CPU Setting to Performance

The below is an example of setting the CPU to performance for Ubuntu:

#Get setting:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

#Set setting:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Note

The CPU settings must be updated on bare metal before starting the container.

Update CPU Settings

This section describes how to update CPU settings on Gaudi 3 using Sapphire and Granite Rapids to optimize performance.

Sapphire Rapids - 4th Gen Xeon® Processor

  1. Set Energy Perf BIAS:

wrmsr -a 0x1b0 0x0
  1. Set the Core Frequency (HWP):

wrmsr -a 0x774 0x2708

Granite Rapids - 6th Gen Xeon® Processor

Note

Enumerated lists that are auto-numbered start with #..

  1. Disable SubNUMA in BIOS.

  2. Disable C6:

    echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state2/disable
    
  3. Set Latency Optimized Mode (LOM). This can be done either by cloning https://github.com/intel/pcm.git or in BIOS:

    • Cloning pcm.git:

    1. Clone Intel® PCM API and build the binaries. Follow Building PCM Tools section.

    2. Copy path_to_build/bin/pcm_tpmi to /usr/local/bin/pcm-tpmi.

    • In BIOS. Make sure to be on a relatively new BIOS:

      1. Click Socket Configuration.

      1. Navigate to Advanced Power Management Configuration.

      2. Locate and choose CPU - Advanced PM Tuning.

      3. Enable Latency Optimized Mode.

  4. Set Energy Perf BIAS:

    wrmsr -a 0x1b0 0x0
    
  5. Set the Core Frequency (HWP)

    wrmsr -a 0x774 0x2708
    

Ensure System Components are Updated

Refer to the Installation Guide to check for the latest Docker run commands, FW and SW packages, BMC and CPLD versions and make sure they are installed properly.

Ensure Dataset/Model Code/Output Utilize High-Performance NVMe/SSD Storage

For best performance, use NVMe for all datasets, model code, and output locations when running training.