Multiple Workloads on a Single Docker
On this Page
Multiple Workloads on a Single Docker¶
Running a workload with partial Intel® Gaudi® AI accelerators requires a few changes. The following sections describe the changes required using the example provided in multi_tenants_resnet_pt.sh example. The example runs two ResNet workloads at the same time, each workload using four Gaudi processors.
Using HABANA_VISIBLE_MODULES
¶
To run a workload with part of the available Gaudi processors, set the module IDs of the used Gaudis in HABANA_VISIBLE_MODULES
.
In general, there are 8 Gaudi processors on a node,
so the module IDs would be in the range of 0 ~ 7. If you want to run a
4-Gaudi workload, you can set the below before you run the workload:
export HABANA_VISIBLE_MODULES="0,1,2,3"
If you want to run another 4-Gaudi workload in parallel, set the below before running the second workload to let it use the rest of the available four Gaudi processors:
export HABANA_VISIBLE_MODULES="4,5,6,7"
Though using partial Gaudi in a workload is possible, only 2-Gaudi and
4-Gaudi scenarios are supported. It is highly recommended to set the module ID
in HABANA_VISIBLE_MODULES
using the combinations listed below:
2-Gaudi - “0,1”, “2,3”, “4,5” or “6,7”
4-Gaudi - “0,1,2,3” or “4,5,6,7”
In the multi_tenants_resnet_pt.sh example,
HABANA_VISIBLE_MODULES="0,1,2,3"
is set for one ResNet workload and
HABANA_VISIBLE_MODULES="4,5,6,7"
is set for the second ResNet workload. The workloads are invoked
consecutively as background jobs and run in parallel.
Note
If HABANA_VISIBLE_DEVICES
is not set, the PyTorch process acquires any available Gaudi processor within a
single server. However, in a multi-server setup, it does not acquire any available Gaudi processor
as partial Gaudi processors across different nodes must be explicitly assigned using
well-defined module IDs and HABANA_VISIBLE_DEVICES
to ensure optimal performance.
Setting the Model-specific Arguments¶
Make sure to set the model-specific arguments properly. For example, if you need to change the number of the process from 8 Gaudis to 2 or
4, the -np
argument of the mpirun
command provided in the multi_tenants_resnet_pt.sh example should be updated.
Updating model arguments to set the folder where temporary data or checkpoints are saved may be required.
In the multi_tenants_resnet_pt.sh example
you can find --model
argument in the mpirun
command.
Make sure different workloads use different folders, otherwise the content might be overwritten unexpectedly.