Gaudi-to-process Assignment
On this Page
Gaudi-to-process Assignment¶
Automatic Assignment¶
In distributed training scenarios, Habana TensorFlow Bridge tries to assure that each Gaudi assigned to each process has a Module ID equal to MPI local rank of that process.
Local rank value is established using the OMPI_COMM_WORLD_LOCAL_RANK
environment variable, which should be set in an MPI complaint runtime environment.
If OMPI_COMM_WORLD_LOCAL_RANK
value is not set, every process will allocate the first device that is not busy.
Manual Assignment¶
The user can manually assign a Gaudi with a given Module ID to the particular process by setting either HLS_MODULE_ID
or HABANA_VISIBLE_MODULES
environment variable.
Flags |
Value Format |
Description |
---|---|---|
HLS_MODULE_ID |
Single integer value from range [0,N-1], where N is number of Gaudi devices available in the server. |
The process will always try to acquire Gaudi with Module ID equal to given value. This variable must have a different value for every process. |
HABANA_VISIBLE_MODULES |
Comma separated list of integers from
valid In scale-out scenarios, the number of elements in the list must be equal to the number of Gaudis in the server. In scale-up scenarios, the number of elements in the list must be equal to the number of devices used in training. Each value on the list should be unique. |
Each process will try to acquire Module ID from the list with position corresponding to its local rank number. This variable must have the same value for every process. |