Using DistributedTensor with Intel Gaudi
On this Page
Using DistributedTensor with Intel Gaudi¶
DistributedTensor (DTensor) is a PyTorch extension that provides a distributed tensor abstraction. It allows to shard tensors across multiple devices and GPUs, and to perform operations on those tensors in a distributed manner. This can be useful for training large models on multiple machines, or for running inference on large datasets.
DTensor is supported with Intel® Gaudi® 2 AI accelerator and standard Pytorch interfaces. For more details, refer to Distributed Tensor Theory of Operations.
Intel Gaudi with DTensor is only supported in Eager mode or with torch.compile
(PT_HPU_LAZY_MODE=0
).
Note
This feature is currently experimental.
CustomOp is currently not supported.
Lazy mode is not supported.
Some ops do not support DTensor. Additional ops support will be available in future releases.
Reduction ops, -mean, are not supported by Gaudi.
Overview¶
DTensor provides a number of features that make it easy to work with distributed tensors. For example, it provides support for automatic data sharding and replication, as well as for efficient communication between devices. DTensor also provides a number of tools for debugging and monitoring distributed training.
In case of parallelism, there are different ways to spread data and compute across devices. DTensor achieves this using different types of distribution mechanisms:
Shard - Splits the tensor on the specified dimension across devices.
Replicate - Replicates the tensor across devices.
Partial - Stores partial values only in a tensor but maintains the global shape. This is used to aid in intermediate computations such as allreduce and compute.
The core building block for a DTensor is the logical device mesh and the placement strategy of tensor distributions:
Device Mesh - Describes the layout of devices which allows for the devices in a group to communicate during an operation. It is an n-dimensional array where the correct tensor would be placed.
PlacementSpec - Captures the different tensor distribution types such as shard, replicate, partial. It is used to describe how the tensor data is distributed across a specific dimension on devices as mentioned in device mesh.
Running a Simple Model Using DTensor on Gaudi¶
The toy_example.py
below is a simple example that demonstrates the usage of distributed tensor on Gaudi. This example uses four nodes:
Line 20 - Import
habana_frameworks.torch.distributed.hccl
:Line 52 - Create a device mesh of Gaudi devices based on the sharding plan:
Line 68 - Parallelize the module based on the given parallel style on Gaudi device:
Executing the Example¶
Execute the toy_example.py
by running the below command. Since lazy mode is the default, running the model with PT_HPU_LAZY_MODE=0
disables lazy mode:
Supported Features¶
The following table lists the DTensor features supported on Gaudi. For more details on DTensor, refer to DTensor GitHub page.
Feature |
Description |
---|---|
Sharding |
Shard on tensor dimension across devices. |
Replication |
Replicate across devices. |
Partial |
Tensor with same shape but only partial values. |
Sharding+Replication |
Tensor types. |
DeviceMesh |
Abstraction that describes global view of devices within a cluster. |
distribute_tensor |
Distributed tensor according to device_mesh placements. |
distribute_module |
Converts all module parameters to distributed tensor parameters. |
parallelize_module |
To parallelize tensor for tensor parallelism. |
redistribute |
Convert from one transformation to another. For example, rowwise sharding to columnwise. |
dtensor_from_local |
Convert torch.tensor to dtensor. |
dtensor_to_local |
Convert DTensor to torch.tensor. |
checkpoint |
Save/load large sharded models. |