Multiple Workloads on a Single Docker
On this Page
Multiple Workloads on a Single Docker¶
Running a workload with partial Gaudi processors requires a few changes. The following sections describe the changes required using the example provided in multi_tenants_resnet_pt.sh example. The example runs two ResNet workloads at the same time, each workload using four Gaudi processors.
Add Environment Variable HABANA_VISIBLE_MODULES¶
To run a workload with part of the available Gaudi processors, you need
to set the module IDs of the used Gaudi processors in the environment,
HABANA_VISIBLE_MODULES. In general, there are eight Gaudi processors on a node,
so the module IDs would be in the range of 0 ~ 7. If you want to run a
4-Gaudi workload, you can set the below before you run the workload:
If you want to run another 4-Gaudi workload in parallel, you can set the below before running the second workload to let it use the rest of the available four Gaudi processors.
In the multi_tenants_resnet_pt.sh example,
HABANA_VISIBLE_MODULES="0,1,2,3" is set for one ResNet workload and
HABANA_VISIBLE_MODULES="4,5,6,7" is set for the second ResNet workload. The workloads are invoked
consecutively as background jobs and run in parallel.
Number of Supported Gaudis for Multi-Tenancy Workload¶
Though using partial Gaudi in a workload is possible, only 2-Gaudi and
4-Gaudi scenarios are supported. It is highly recommended to set
HABANA_VISIBLE_MODULES using the combinations listed below:
2-Gaudi - “0,1”, “2,3”, “4,5” or “6,7”
4-Gaudi - “0,1,2,3” or “4,5,6,7”
Set the Model Specific Arguments¶
Make sure to set the model specific arguments properly. For example, if you need to change the number of the process from 8 Gaudis to 2 or
4, the argument
-np of the
mpirun command provided in the multi_tenants_resnet_pt.sh example should be updated.
Updating model arguments for setting the folder where temporary data or checkpoints are saved may be required.
In the multi_tenants_resnet_pt.sh example
you can find
--model argument in the
Make sure different workloads use different folders, otherwise the content might be overwritten unexpectedly.