Multiple Workloads on a Single Docker

Running a workload with partial Gaudi processors requires a few changes. The following sections describe the changes required using the example provided in example. The example runs two ResNet workloads at the same time, each workload using four Gaudi processors.


Add Environment Variable HABANA_VISIBLE_MODULES

To run a workload with part of the available Gaudi processors, you need to set the module IDs of the used Gaudi processors in the environment, HABANA_VISIBLE_MODULES. In general, there are eight Gaudi processors on a node, so the module IDs would be in the range of 0 ~ 7. If you want to run a 4-Gaudi workload, you can set the below before you run the workload:


If you want to run another 4-Gaudi workload in parallel, you can set the below before running the second workload to let it use the rest of the available four Gaudi processors.


In the example, HABANA_VISIBLE_MODULES="0,1,2,3" is set for one ResNet workload and HABANA_VISIBLE_MODULES="4,5,6,7" is set for the second ResNet workload. The workloads are invoked consecutively as background jobs and run in parallel.

Number of Supported Gaudis for Multi-Tenancy Workload

Though using partial Gaudi in a workload is possible, only 2-Gaudi and 4-Gaudi scenarios are supported. It is highly recommended to set HABANA_VISIBLE_MODULES using the combinations listed below:

  • 2-Gaudi - “0,1”, “2,3”, “4,5” or “6,7”

  • 4-Gaudi - “0,1,2,3” or “4,5,6,7”

Set the Model Specific Arguments

Make sure to set the model specific arguments properly. For example, if you need to change the number of the process from 8 Gaudis to 2 or 4, the argument -np of the mpirun command provided in the example should be updated.

Updating model arguments for setting the folder where temporary data or checkpoints are saved may be required. In the example you can find --model argument in the mpirun command.

Make sure different workloads use different folders, otherwise the content might be overwritten unexpectedly.