Multiple Workloads on a Single Docker

Running a workload with partial Gaudi processors requires a few changes. The following sections describe the changes required using the example provided in TensorFlow Model References GitHub page. The example runs two ResNet workloads at the same time, each workload using four Gaudi processors.

../../_images/Multiple_workloads_on_single_docker.JPG

Add Environment Variable HABANA_VISIBLE_MODULES

To run a workload with part of the available Gaudi processors, you need to set the module IDs of the used Gaudi processors in the environment, HABANA_VISIBLE_MODULES. In general, there are eight Gaudi processors on a node, so the module IDs would be in the range of 0 ~ 7. If you want to run a 4-Gaudi workload, you can set the below before you run the workload:

export HABANA_VISIBLE_MODULES="0,1,2,3"

If you want to run another 4-Gaudi workload in parallel, you can set the below before running the second workload to let it use the rest of the available four Gaudi processors.

export HABANA_VISIBLE_MODULES="4,5,6,7"

In the multi_tenants_resnet.sh example, HABANA_VISIBLE_MODULES="0,1,2,3" is set for one ResNet workload and HABANA_VISIBLE_MODULES="4,5,6,7" is set for the second ResNet workload. The workloads are invoked consecutively as background jobs and run in parallel.

Number of Supported Gaudis for Multi-Tenancy Workload

Though using partial Gaudi in a workload is possible, only 2-Gaudi and 4-Gaudi scenarios are supported. It is highly recommended to set HABANA_VISIBLE_MODULES using the combinations listed below:

  • 2-Gaudi - “0,1”, “2,3”, “4,5” or “6,7”

  • 4-Gaudi - “0,1,2,3” or “4,5,6,7”

Set the Model Specific Arguments

Make sure to set the model specific arguments properly. For example, if you need to change the number of the process from 8 Gaudis to 2 or 4, the argument -np of the mpirun command provided in the multi_tenants_resnet.sh example should be updated.

Updating model arguments for setting the folder where temporary data or checkpoints are saved may be required. For example, in ResNet-50 script the argument -md sets the path to the model directory. In the multi_tenants_resnet.sh example you can find --model_dir argument in the mpirun command.

Make sure different workloads use different folders, otherwise the content might be overwritten unexpectedly.

“Hello World” Example of Multiple Workloads on Single Docker Container

In addition to Multi_tenants_resnet.sh example which runs two ResNet-50 workloads, Habana also provides a Hello World Example for multiple tenants on a single docker container.

The script runs two MNIST workloads on four Gaudi processors in the background and runs them in parallel. Since the MNIST workload downloads the data automatically, preparing the data for the workload is not required. You can try the script by running the below command:

<Path to Model-References>/TensorFlow/examples/hello_world/run_multi_hvd_4_4.sh