Multiple Dockers Each with a Single Workload
On this Page
Multiple Dockers Each with a Single Workload¶
Multiple tenants can run on a single workload on multiple Dockers. Each one of the created Docker containers owns part of the Intel® Gaudi® AI accelerators exclusively while one distributed workload runs on each container.
Setting the Docker Container¶
With habanalabs-container-runtime
, you can select a Gaudi device to mount on Docker by setting HABANA_VISIBLE_DEVICES
when the Docker container is started. Below is an example of a Docker run
command which mounts four Gaudi processors with index 0, 1, 2, and 3:
docker run … --runtime=habana -e HABANA_VISIBLE_DEVICES=0,1,2,3 ...
Setting HABANA_VISIBLE_MODULES
¶
There are some guidelines on setting HABANA_VISIBLE_DEVICES
, however,
you need to know how to find the mapping between the index and module ID
of the Gaudi processors before reading the guidelines.
The below command is a sample output of the mapping between index
and module ID of the Gaudi processors:
$ hl-smi -Q index,module_id -f csv
index, module_id
3, 6
1, 4
2, 7
0, 5
4, 2
6, 0
7, 3
5, 1
With the mapping between index and module ID, you can set
HABANA_VISIBLE_DEVICES
properly with the guidelines below:
Mount two or four Gaudi processors in the Docker container. Though using partial Gaudi in a distributed workload is possible, only 2-Gaudi and 4-Gaudi scenarios are supported. It is highly recommended to set module ID in
HABANA_VISIBLE_MODULES
using the combinations listed below:2-Gaudi: “0,1”, “2,3”, “4,5” or “6,7”
4-Gaudi: “0,1,2,3” or “4,5,6,7”
Since
HABANA_VISIBLE_DEVICES
accepts index instead of module ID, you need to leverage the above command to figure out the corresponding indices for a set of module IDs.Avoid mounting the same index on multiple containers. Since multiple workloads might run in parallel, avoiding mounting the same Gaudi to multiple Docker containers can prevent reusing the same Gaudi in different workloads.
Note
If HABANA_VISIBLE_DEVICES
is not set, the PyTorch process acquires any available Gaudi processor within a
single server. However, in a multi-server setup, it does not acquire any available Gaudi processor
as partial Gaudi processors across different nodes must be explicitly assigned using
well-defined module IDs and HABANA_VISIBLE_DEVICES
to ensure optimal performance.
Running Distributed Workload Inside the Docker Container¶
Though there is only one workload running in the container in this scenario,
it’s necessary to set the environment variable HABANA_VISIBLE_MODULES
as
in the Multiple Workloads on a Single Docker scenario.
If you are the creator of the Docker container, you should be able to get
the corresponding module_id
of the devices specified in HABANA_VISIBLE_DEVICES
with the hl-smi
command mentioned in the above section and set the HABANA_VISIBLE_MODULES
accordingly.
If you use a Docker container created by others, you can run the below command
inside of the container to get the module_id
of Gaudis available in the container:
$ hl-smi -Q module_id -f csv
module_id
4
6
5
7
According to the output in the above example, you can set the environment variable
HABANA_VISIBLE_MODULES
as below:
export HABANA_VISIBLE_MODULES="4,5,6,7"