DeepSpeed User Guide¶
The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana® Gaudi® infrastructure using a DeepSpeed interface.
DeepSpeed Gaudi Integration¶
Existing model training scripts can be migrated to use DeepSpeed library and integrate new optimizations into the training process. Refer to https://www.deepspeed.ai for further details.
Zero Redundancy optimizers
Usage in lower precision data types as Bfloat16
Checkpoint Activation
Model Pipelining parallelism
The HabanaAI GitHub shares a fork of the DeepSpeed library that includes changes to add support for Gaudi. To use DeepSpeed with Gaudi, you must install Habana’s fork for DeepSpeed:
By installing directly from the DeepSpeed fork repository located in HabanaAI GitHub:
pip install git+https://github.com/HabanaAI/DeepSpeed.git@v1.5.0
Or, by cloning the DeepSpeed fork repository and installing from local directory:
git clone git+https://github.com/HabanaAI/DeepSpeed.git@v1.5.0
cd DeepSpeed
pip install
DeepSpeed Validated Configurations¶
The following DeepSpeed configurations have been validated to be fully functioning with Gaudi:
Distributed Data Parallel (multi-card)
Zero1
BF16 precision
Note
DeepSpeed’s multi-node training uses pdsh
for invoking the processes on remote hosts. Make sure it is installed on your machine before using it.
Working with DeepSpeed on Gaudi¶
If you have an existing training script that runs on Gaudi, it can be converted into DeepSpeed by following the instructions in https://www.deepspeed.ai.
In order to align both training script and DeepSpeed to use Gaudi, it is highly recommended to use only the dedicated DeepSpeed flag
use_hpu
inside the training script.You can provide a DeepSpeed json config file to specify a collection of training settings. See https://www.deepspeed.ai/docs/config-json/.
It is highly recommended to review one of the examples located under DeepSpeed Examples.