Multiple Workloads on a Single Docker
On this Page
Multiple Workloads on a Single Docker¶
Running a workload with partial Gaudi processors requires a few changes. The following sections describe the changes required using the example provided in TensorFlow Model References GitHub page. The example runs two ResNet workloads at the same time, each workload using four Gaudi processors.
Add Environment Variable HABANA_VISIBLE_MODULES¶
To run a workload with part of the available Gaudi processors, you need
to set the module IDs of the used Gaudi processors in the environment,
HABANA_VISIBLE_MODULES. In general, there are eight Gaudi processors on a node,
so the module IDs would be in the range of 0 ~ 7. If you want to run a
4-Gaudi workload, you can set the below before you run the workload:
If you want to run another 4-Gaudi workload in parallel, you can set the below before running the second workload to let it use the rest of the available four Gaudi processors.
In the multi_tenants_resnet.sh example,
HABANA_VISIBLE_MODULES="0,1,2,3" is set for one ResNet workload and
HABANA_VISIBLE_MODULES="4,5,6,7" is set for the second ResNet workload. The workloads are invoked
consecutively as background jobs and run in parallel.
Number of Supported Gaudis for Multi-Tenancy Workload¶
Though using partial Gaudi in a workload is possible, only 2-Gaudi and
4-Gaudi scenarios are supported. It is highly recommended to set
HABANA_VISIBLE_MODULES using the combinations listed below:
2-Gaudi - “0,1”, “2,3”, “4,5” or “6,7”
4-Gaudi - “0,1,2,3” or “4,5,6,7”
Set the Model Specific Arguments¶
Make sure to set the model specific arguments properly. For example, if you need to change the number of the process from 8 Gaudis to 2 or
4, the argument
-np of the
mpirun command provided in the multi_tenants_resnet.sh example should be updated.
Updating model arguments for setting the folder where temporary data or checkpoints are saved may be required.
For example, in ResNet-50 script the argument
-md sets the path to the model directory.
In the multi_tenants_resnet.sh example
you can find
--model_dir argument in the
Make sure different workloads use different folders, otherwise the content might be overwritten unexpectedly.
“Hello World” Example of Multiple Workloads on Single Docker Container¶
The script runs two MNIST workloads on four Gaudi processors in the background and runs them in parallel. Since the MNIST workload downloads the data automatically, preparing the data for the workload is not required. You can try the script by running the below command:
<Path to Model-References>/TensorFlow/examples/hello_world/run_multi_hvd_4_4.sh