Multiple Workloads on a Single Docker

Running a workload with partial Gaudi processors requires a few changes. The following sections describe the changes required using the example provided in multi_tenants_resnet_pt.sh example. The example runs two ResNet workloads at the same time, each workload using four Gaudi processors.

../../../_images/Multiple_workloads_on_single_docker.JPG

Add Environment Variable HABANA_VISIBLE_MODULES

To run a workload with part of the available Gaudi processors, you need to set the module IDs of the used Gaudi processors in the environment, HABANA_VISIBLE_MODULES. In general, there are eight Gaudi processors on a node, so the module IDs would be in the range of 0 ~ 7. If you want to run a 4-Gaudi workload, you can set the below before you run the workload:

export HABANA_VISIBLE_MODULES="0,1,2,3"

If you want to run another 4-Gaudi workload in parallel, you can set the below before running the second workload to let it use the rest of the available four Gaudi processors.

export HABANA_VISIBLE_MODULES="4,5,6,7"

In the multi_tenants_resnet_pt.sh example, HABANA_VISIBLE_MODULES="0,1,2,3" is set for one ResNet workload and HABANA_VISIBLE_MODULES="4,5,6,7" is set for the second ResNet workload. The workloads are invoked consecutively as background jobs and run in parallel.

Number of Supported Gaudis for Multi-Tenancy Workload

Though using partial Gaudi in a workload is possible, only 2-Gaudi and 4-Gaudi scenarios are supported. It is highly recommended to set HABANA_VISIBLE_MODULES using the combinations listed below:

  • 2-Gaudi - “0,1”, “2,3”, “4,5” or “6,7”

  • 4-Gaudi - “0,1,2,3” or “4,5,6,7”

Set the Model Specific Arguments

Make sure to set the model specific arguments properly. For example, if you need to change the number of the process from 8 Gaudis to 2 or 4, the argument -np of the mpirun command provided in the multi_tenants_resnet_pt.sh example should be updated.

Updating model arguments for setting the folder where temporary data or checkpoints are saved may be required. In the multi_tenants_resnet_pt.sh example you can find --model argument in the mpirun command.

Make sure different workloads use different folders, otherwise the content might be overwritten unexpectedly.