DeepSpeed User Guide for Training¶

The purpose of this document is to guide Data Scientists to run PyTorch models on Intel® Gaudi® AI accelerator using DeepSpeed.

DeepSpeed Validated Configurations¶

The following configurations have been validated to be fully functioning for DeepSpeed training on Gaudi:

Configuration	Description	Example
Distributed Data Parallel (multi-card)	Trains the same model across multiple ranks by splitting the datasets between the workers to achieve better performance compared to a single card.	N/A
ZeRO-1	Partitions the optimizer states across the ranks so that each process updates its own partition. For further details, refer to Using ZeRO section.	Bert
ZeRO-2	On top of ZeRO-1, each process retains only the gradients corresponding to its portion of the optimizer states.	Bert
ZeRO-3	The full model state is partitioned across the processes (including 16-bit weights). ZeRO-3 automatically collects and partitions them during the forward and backward passes. Make sure to use only optimizers that have been tested with DeepSpeed ZeRO. For further details, refer to Using ZeRO section.	Flan_T5_XXL
ZeRO++ hpZ	ZeRO++ is a set of optimization methods that extend ZeRO capabilities and enhance large model training efficiency. It can only be used with ZeRO-3. Hierarchical partitioning ZeRO (hpZ) is one of ZeRO++ three communication optimizations. Support for the other two methods will be added in future releases. Unlike ZeRO, hpZ keeps a complete model copy on each machine. Although this approach leads to increased memory usage, it replaces the costly cross-machine all-gather/broadcast on weights with an intra-machine alternative, which is faster due to high intra-machine communication bandwidth.	DeepSpeed ZeRO++ Tutorial
ZeRO-Offload	Offloads the optimizer’s memory and computation from HPU to the host CPU. The implementation of Adam on CPU is made more efficient by DeepSpeedCPUAdam.	offload_optimizer_to_cpu
ZeRO-Infinity	Extends ZeRO-3 functionality by allowing the offload of both the model and optimizer parameters to the CPU memory.	offload_optimizer_param_to_cpu
Model Pipeline Parallelism	Splits the model layers between several workers so each one will execute the forward and backward of their own layer.	N/A
BF16 Precision	Reduces model memory consumption and improves performance by training with BF16 precision.	Bert
BF16 Optimizer	Allows BF16 precision training with pipeline parallelism. An optimizer that implements ZeRO-1 for BF16 and with gradient accumulation at FP32.	Bert
Activation Checkpointing	Recomputes forward pass activations during the backward pass in order to save memory. For further details, refer to Using Activation Checkpointing section.	Bert
`torch.compile`	Wraps parts of a model into a graph for improved performance. Model parts are compiled once at the start, allowing the compiled part to be called throughout execution. For further details, refer to Using torch.compile section.	Bert

Note

DeepSpeed’s multi-server training uses pdsh for invoking the processes on remote hosts. Make sure it is installed on your machine before using it.
Upon initialization, Intel Gaudi DeepSpeed enforces Deterministic behavior by setting habana_frameworks.torch.hpu.setDeterministic(True).
All further information on DeepSpeed configurations can be found in DeepSpeed documentation.

Installing DeepSpeed Library¶

Intel Gaudi provides a DeepSpeed fork which includes changes to add support for the Intel Gaudi software. To use DeepSpeed with Gaudi, you must install Intel Gaudi’s DeepSpeed fork. Intel Gaudi’s DeepSpeed fork is based on DeepSpeed v0.14.4:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.22.1

Integrating DeepSpeed with Gaudi¶

To run DeepSpeed on Gaudi:

Prepare your PyTorch model to run on Gaudi by following the steps detailed in the PyTorch Model Porting section. If you have an existing training script that runs on Gaudi, migrating your model is not required.
Follow the instructions in https://www.deepspeed.ai/getting-started/ with the following modifications:
1. Replace the loss.backward() and optimizer.step()) with model_engine.backward(loss) and model_engine.step()).
2. Replace all usages of model object in deepspeed.initialize() call with the returned new model_engine object.
3. Remove from torch.nn.parallel import DistributedDataParallel as DDP and remove the DDP call for the model.

In deepspeed.init_distributed(), make sure that dist_backend is set to HCCL:

deepspeed.init_distributed(dist_backend='hccl', init_method = <init_method>)

For the current release, the following steps are required in this specific order before calling deepspeed.initialize():
1. Move your model to HPU and cast it to BF16 in case required:
  model.to(hpu, bf16)
If your model uses weight sharing, make sure these weights are created inside the module. Refer to Weight Sharing.

Initialize the optimizer.

Configure the throughput timer to be unsynchronized by adding the following to your JSON file:

"timers": {
  "throughput": {
    "enabled": true,
    "synchronized": false
  }
}

Note

It is highly recommended to review our DeepSpeed-BERT pre-training example.

Using ZeRO¶

ZeRO-1 - For optimal performance of ZeRO-1, it is recommended to configure contiguous_gradients=false parameter in the DeepSpeed ZeRO settings. The following shows a usage example:
"zero_optimization": { "stage": 1, ... "contiguous_gradients": false, }
ZeRO-3 - For optimal performance of ZeRO-3, it is recommended to configure the following parameters in the DeepSpeed ZeRO settings:
- overlap_comm=false
- contiguous_gradients=true
- reduce_scatter": false
The following shows a usage example:
```
"zero_optimization": {
    "stage": 3,
    "overlap_comm": false,
    ...

    "contiguous_gradients": true,
    "reduce_scatter": false
}
```

For further information on how to configure ZeRO, refer to ZeRO Configuration section.

Using Activation Checkpointing¶

To use activation checkpointing with Gaudi, integrate deepspeed.runtime.activation_checkpointing.checkpointing.checkpoint wrapper from Intel Gaudi’s DeepSpeed fork into your model according to the instructions in TORCH.UTILS.CHECKPOINT guide. The below example is taken from the DeepSpeed-BERT script/modeling.py:

class BertEncoder(nn.Module):
    def __init__(self, config):
        super(BertEncoder, self).__init__()
        self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])
        self.output_all_encoded_layers = config.output_all_encoded_layers
        self._checkpoint_activations = False
        self._checkpoint_activations_interval = 1

        ...

    def forward(self, hidden_states, attention_mask):
        all_encoder_layers = []

        layer_norm_input = 0
        if self._checkpoint_activations:
            hidden_states, layer_norm_input = self.checkpointed_forward(
                hidden_states, layer_norm_input, attention_mask, self._checkpoint_activations_interval)
        else:
            for i, layer_module in enumerate(self.layer):
                hidden_states, layer_norm_input = layer_module(hidden_states, layer_norm_input, attention_mask)

The values of the following parameters have been validated to be fully functioning on Gaudi:

“partition_activations”: true/false
“cpu_checkpointing”: true/false
“contiguous_memory_optimization”: true/false - As per DeepSpeed documentation, contiguous_memory_optimizationcan=true only when partition_activations=true.
“synchronize_checkpoint_boundary”: true/false
“profile”: false

Note

It is recommended to use the native activation checkpointing function of PyTorch if your model has torch.compile enabled. Running workloads with DeepSpeed activation checkpointing and torch.compile may result in accuracy issues.

For further details, refer to Configuring Activation Checkpointing section.

Using `torch.compile`¶

When compiling your model with DeepSpeed, call the dedicated compile() API of the DeepSpeedEngine, which is the object returned by deepspeed.initialize():

def compile(self,
         backend=get_accelerator().get_compile_backend(),
         compile_kwargs={},
         compile_optimizer_step=False,
         compiled_autograd_enabled=False) -> None:

Note

Calling DeepSpeedEngine compile() with the default arguments is sufficient. However, specifying a non-default backend or passing other compile_kwargs to the torch.compile API is also supported.
The compile_optimizer_step and compiled_autograd_enabled features can be set, but they are still under development and may cause an unexpected behavior.

You can find a usage code in the DeepSpeed-BERT pre-training example.

Gaudi Documentation 1.22.1 documentation

DeepSpeed User Guide for Training

On this Page

DeepSpeed User Guide for Training¶

DeepSpeed Validated Configurations¶

Installing DeepSpeed Library¶

Integrating DeepSpeed with Gaudi¶

Using ZeRO¶

Using Activation Checkpointing¶

Using `torch.compile`¶

Gaudi Documentation 1.22.1 documentation

DeepSpeed User Guide for Training

On this Page

DeepSpeed User Guide for Training¶

DeepSpeed Validated Configurations¶

Installing DeepSpeed Library¶

Integrating DeepSpeed with Gaudi¶

Using ZeRO¶

Using Activation Checkpointing¶

Using torch.compile¶

Using `torch.compile`¶