Multiple Dockers Each with a Single Workload

Multiple tenants can run on a single workload on multiple Dockers. Each one of the created Docker containers owns part of the Intel® Gaudi® AI accelerators exclusively while one distributed workload runs on each container.

../../_images/Multiple_dockers_each_with_single_workload.JPG

Setting the Docker Container

With habanalabs-container-runtime, you can select a Gaudi device to mount on Docker by setting HABANA_VISIBLE_DEVICES when the Docker container is started. Below is an example of a Docker run command which mounts four Gaudi processors with index 0, 1, 2, and 3:

docker run … --runtime=habana -e HABANA_VISIBLE_DEVICES=0,1,2,3 ...

Setting HABANA_VISIBLE_MODULES

There are some guidelines on setting HABANA_VISIBLE_DEVICES, however, you need to know how to find the mapping between the index and module ID of the Gaudi processors before reading the guidelines. The below command is a sample output of the mapping between index and module ID of the Gaudi processors:

$ hl-smi -Q index,module_id -f csv

index, module_id

3, 6

1, 4

2, 7

0, 5

4, 2

6, 0

7, 3

5, 1

With the mapping between index and module ID, you can set HABANA_VISIBLE_DEVICES properly with the guidelines below:

  • Mount two or four Gaudi processors in the Docker container. Though using partial Gaudi in a distributed workload is possible, only 2-Gaudi and 4-Gaudi scenarios are supported. It is highly recommended to set module ID in HABANA_VISIBLE_MODULES using the combinations listed below:

    • 2-Gaudi: “0,1”, “2,3”, “4,5” or “6,7”

    • 4-Gaudi: “0,1,2,3” or “4,5,6,7”

  • Since HABANA_VISIBLE_DEVICES accepts index instead of module ID, you need to leverage the above command to figure out the corresponding indices for a set of module IDs.

  • Avoid mounting the same index on multiple containers. Since multiple workloads might run in parallel, avoiding mounting the same Gaudi to multiple Docker containers can prevent reusing the same Gaudi in different workloads.

Note

If HABANA_VISIBLE_DEVICES is not set, the PyTorch process acquires any available Gaudi processor within a single server. However, in a multi-server setup, it does not acquire any available Gaudi processor as partial Gaudi processors across different nodes must be explicitly assigned using well-defined module IDs and HABANA_VISIBLE_DEVICES to ensure optimal performance.

Running Distributed Workload Inside the Docker Container

Though there is only one workload running in the container in this scenario, it’s necessary to set the environment variable HABANA_VISIBLE_MODULES as in the Multiple Workloads on a Single Docker scenario.

If you are the creator of the Docker container, you should be able to get the corresponding module_id of the devices specified in HABANA_VISIBLE_DEVICES with the hl-smi command mentioned in the above section and set the HABANA_VISIBLE_MODULES accordingly.

If you use a Docker container created by others, you can run the below command inside of the container to get the module_id of Gaudis available in the container:

$ hl-smi -Q module_id -f csv

module_id

4

6

5

7

According to the output in the above example, you can set the environment variable HABANA_VISIBLE_MODULES as below:

export HABANA_VISIBLE_MODULES="4,5,6,7"