# DeepSpeed User Guide¶

The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana® Gaudi® infrastructure using a DeepSpeed interface.

## DeepSpeed Gaudi Integration¶

Existing model training scripts can be migrated to use DeepSpeed library and integrate new optimizations into the training process. Refer to https://www.deepspeed.ai for further details.

• Zero Redundancy optimizers

• Usage in lower precision data types as Bfloat16

• Checkpoint Activation

• Model Pipelining parallelism

The HabanaAI GitHub shares a fork of the DeepSpeed library that includes changes to add support for Gaudi. To use DeepSpeed with Gaudi, you must install Habana’s fork for DeepSpeed:

• By installing directly from the DeepSpeed fork repository located in HabanaAI GitHub:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@v1.5.0

• Or, by cloning the DeepSpeed fork repository and installing from local directory:

git clone git+https://github.com/HabanaAI/DeepSpeed.git@v1.5.0

cd DeepSpeed

pip install


## DeepSpeed Validated Configurations¶

The following DeepSpeed configurations have been validated to be fully functioning with Gaudi:

• Distributed Data Parallel (multi-card)

• Zero1

• BF16 precision

Note

DeepSpeed’s multi-node training uses pdsh for invoking the processes on remote hosts. Make sure it is installed on your machine before using it.

## Working with DeepSpeed on Gaudi¶

• If you have an existing training script that runs on Gaudi, it can be converted into DeepSpeed by following the instructions in https://www.deepspeed.ai.

• In order to align both training script and DeepSpeed to use Gaudi, it is highly recommended to use only the dedicated DeepSpeed flag use_hpu inside the training script.

• You can provide a DeepSpeed json config file to specify a collection of training settings. See https://www.deepspeed.ai/docs/config-json/.

It is highly recommended to review one of the examples located under DeepSpeed Examples.