Multiple Workloads on a Single Docker

Running a workload with partial Intel® Gaudi® AI accelerators requires a few changes. The following sections describe the changes required using the example provided in multi_tenants_resnet_pt.sh example. The example runs two ResNet workloads at the same time, each workload using four Gaudi processors.

../../_images/Multiple_workloads_on_single_docker.JPG

Using HABANA_VISIBLE_MODULES

To run a workload with part of the available Gaudi processors, set the module IDs of the used Gaudis in HABANA_VISIBLE_MODULES. In general, there are 8 Gaudi processors on a node, so the module IDs would be in the range of 0 ~ 7. If you want to run a 4-Gaudi workload, you can set the below before you run the workload:

export HABANA_VISIBLE_MODULES="0,1,2,3"

If you want to run another 4-Gaudi workload in parallel, set the below before running the second workload to let it use the rest of the available four Gaudi processors:

export HABANA_VISIBLE_MODULES="4,5,6,7"

Though using partial Gaudi in a workload is possible, only 2-Gaudi and 4-Gaudi scenarios are supported. It is highly recommended to set the module ID in HABANA_VISIBLE_MODULES using the combinations listed below:

  • 2-Gaudi - “0,1”, “2,3”, “4,5” or “6,7”

  • 4-Gaudi - “0,1,2,3” or “4,5,6,7”

In the multi_tenants_resnet_pt.sh example, HABANA_VISIBLE_MODULES="0,1,2,3" is set for one ResNet workload and HABANA_VISIBLE_MODULES="4,5,6,7" is set for the second ResNet workload. The workloads are invoked consecutively as background jobs and run in parallel.

Note

If HABANA_VISIBLE_DEVICES is not set, the PyTorch process acquires any available Gaudi processor within a single server. However, in a multi-server setup, it does not acquire any available Gaudi processor as partial Gaudi processors across different nodes must be explicitly assigned using well-defined module IDs and HABANA_VISIBLE_DEVICES to ensure optimal performance.

Setting the Model-specific Arguments

Make sure to set the model-specific arguments properly. For example, if you need to change the number of the process from 8 Gaudis to 2 or 4, the -np argument of the mpirun command provided in the multi_tenants_resnet_pt.sh example should be updated.

Updating model arguments to set the folder where temporary data or checkpoints are saved may be required. In the multi_tenants_resnet_pt.sh example you can find --model argument in the mpirun command.

Make sure different workloads use different folders, otherwise the content might be overwritten unexpectedly.