Gaudi-to-process Assignment

Automatic Assignment

In distributed training scenarios, Habana TensorFlow Bridge tries to assure that each Gaudi assigned to each process has a Module ID equal to MPI local rank of that process. Local rank value is established using the OMPI_COMM_WORLD_LOCAL_RANK environment variable, which should be set in an MPI complaint runtime environment.

If OMPI_COMM_WORLD_LOCAL_RANK value is not set, every process will allocate the first device that is not busy.

Manual Assignment

The user can manually assign a Gaudi with a given Module ID to the particular process by setting either HLS_MODULE_ID or HABANA_VISIBLE_MODULES environment variable.

Flags

Value Format

Description

HLS_MODULE_ID

Single integer value from range [0,N-1], where N is number of Gaudi devices available in the server.

The process will always try to acquire Gaudi with Module ID equal to given value.

This variable must have a different value for every process.

HABANA_VISIBLE_MODULES

Comma separated list of integers from valid HLS_MODULE_ID range.

In scale-out scenarios, the number of elements in the list must be equal to the number of Gaudis in the server.

In scale-up scenarios, the number of elements in the list must be equal to the number of devices used in training.

Each value on the list should be unique.

Each process will try to acquire Module ID from the list with position corresponding to its local rank number.

This variable must have the same value for every process.