Horovod-based Scaling of Gaudi on TensorFlow
On this Page
Horovod-based Scaling of Gaudi on TensorFlow¶
mpirun Configuration¶
mpirun map-by PE attribute value may vary on your setup and should be calculated as:
socket:PE = floor((number of physical cores) / (number of gaudi devices per each node))
.
This sample code can also be used to calculate the number of physical CPU cores and HPU count to generate the appropriate PE value, shown
as MPI_PE
below. This can be incorporated into any model:
The PE value in the Model-References examples may be set to a common number to ensure functionality, but depending on the Host CPU, the directions above should be used for optimal system performance.
Scale-up Using Gaudi NICs Within a Server¶
The below is a simple example of distributed training and is based on the single Gaudi training example detailed in the Porting a Simple TensorFlow Model to Gaudi. The training model and the corresponding scripts are available in the TensorFlow Hello World Example on Github.
The highlighted lines of code are added for distributed training.
The code above runs in multiple processes, one for each Gaudi.
In order to launch the distributed training for eight Gaudi devices within one host run the following command.
Note
Open MPI is required for host communication and launching processes version. For Open MPI updated version, refer to Support Matrix.
The below is an example output:
7500/7500 [==============================] - 104s 14ms/sample - loss:
0.7289 - accuracy: 0.8361
7500/7500 [==============================] - 104s 14ms/sample - loss:
0.7916 - accuracy: 0.8051
7500/7500 [==============================] - 104s 14ms/sample - loss:
0.7939 - accuracy: 0.8053
7500/7500 [==============================] - 104s 14ms/sample - loss:
0.7928 - accuracy: 0.8093
Scale-out Across Servers¶
Scale-out Using AWS DL1/Host NICs¶
The training model and the corresponding scripts are available in the TensorFlow Hello World Example on GitHub.
A separate script to run a simple example of scale-out using host NICs is provided here. You must append the IP addresses of two servers at the end of the script and the addresses should be separated by whitespaces, similar to the example below:
The script sets the environment variable HOROVOD_HIERARCHICAL_ALLREDUCE
to 1
and invokes a command similar to the below example:
$ mpirun --allow-run-as-root \
--mca btl_tcp_if_include 192.168.0.1/24,192.168.0.2/24
--prefix /usr/local/openmpi/
--host
192.168.0.1,192.168.0.1,192.168.0.1,192.168.0.1,192.168.0.1,192.168.0.1,192.168.0.1,192.168.0.1,19
2.168.0.2,192.168.0.2,192.168.0.2,192.168.0.2,192.168.0.2,192.168.0.2,192.168.0.2,192.168.0.2 \
-x GC_KERNEL_PATH \
-x HABANA_LOGS \
-x TF_MODULES_RELEASE_BUILD \
-x OPAL_PREFIX \
-x PYTHONPATH \
-x LD_LIBRARY_PATH \
-x PATH \
...
-x HOROVOD_HIERARCHICAL_ALLREDUCE \
python3 example_hvd.py
The port listened by the ssh server might be different if the workload is not running inside the
container. You can specify the port of the remote ssh server using the SSHD_PORT
environment variable.
Scale-out Using Gaudi NICs¶
The training model and the corresponding scripts are available in the TensorFlow Hello World Example on GitHub.
Changing the model to run across multiple servers is not required. The script, however, requires some changes.
A new script, run_hvd_16gaudi.sh
is provided here as an example of two servers.
The scale-out ports of the Gaudi devices in one server are connected to
those in another server through a switch.
Run the script using the below command. You must append the IP addresses of two servers at the end of the script and the addresses should be separated by whitespaces, similar to the example below:
By default, the shell scripts connect to port 3022
, however, the port listened
by the SSH server may differ between different environments.
If your environment requires specifying a different port of the remote SSH server, you can
use the SSHD_PORT
environment variable.
The example below uses port 22
:
To change the port, use the below command. Make sure to set the port to 22
as in the below example.
Integrating Horovod with ResNet-50 Model Example¶
ResNet-50 model references can be found in the TensorFlow Model Reference GitHub page. The below steps provide an example of integrating Horovod into a Keras ResNet Model .
General sharding of ImageNet dataset can be found in imagenet_preprocessing.py:
try:
import horovod.tensorflow as hvd
except ImportError:
hvd = None
if hvd is not None and hvd.is_initialized() and (is_training or use_distributed_eval):
logging.info(
'HVD sharding the dataset: input_pipeline_id=%d num_input_pipelines=%d',
hvd.rank(), hvd.size())
dataset = dataset.shard(hvd.size(), hvd.rank())
Note
In the example code, there is an assumption that this import may fail. This is done in order to not enforce artificial dependency on Horovod in single Gaudi or TensorFlow Distributed runs. Sharding, in this example, is conditional and requires two things:
For Horovod import to succeed -
hvd is not None
.For Horovod to already be initialized before within this process -
hvd.is_initialized()
.
Define the
use_horovod
flag located in common.py. The default value is false:
Import horovod functions to be called in file common.py:
Calculate the global batch size based on the batch size per card and total card number in common.py:
Import horovod functions to be called in resnet_ctl_imagenet_main.py:
Note
ImportError
is stored to be raised later and only in case you set the use_horovod
flag.
Calculate the global batch size based on the batch size per card and total card number and name model directory according to rank ID in file resnet_ctl_imagenet_main.py:
Initialize horovod in resnet_ctl_imagenet_main.py:
Import horovod functions to be called in resnet_runnable.py: