DeepSpeed User Guide

The purpose of this document is to guide Data Scientists to run PyTorch models on the Habana® Gaudi® infrastructure using a DeepSpeed interface.

DeepSpeed Gaudi Integration

Existing model training scripts can be migrated to use DeepSpeed library and integrate new optimizations into the training process. Refer to https://www.deepspeed.ai for further details.

  • Zero Redundancy optimizers

  • Usage in lower precision data types as Bfloat16

  • Checkpoint Activation

  • Model Pipelining parallelism

The HabanaAI GitHub shares a fork of the DeepSpeed library that includes changes to add support for Gaudi. To use DeepSpeed with Gaudi, you must install Habana’s fork for DeepSpeed:

  • By installing directly from the DeepSpeed fork repository located in HabanaAI GitHub:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@v1.5.0
  • Or, by cloning the DeepSpeed fork repository and installing from local directory:

git clone git+https://github.com/HabanaAI/DeepSpeed.git@v1.5.0

cd DeepSpeed

pip install

DeepSpeed Validated Configurations

The following DeepSpeed configurations have been validated to be fully functioning with Gaudi:

  • Distributed Data Parallel (multi-card)

  • Zero1

  • BF16 precision

Note

DeepSpeed’s multi-node training uses pdsh for invoking the processes on remote hosts. Make sure it is installed on your machine before using it.

Working with DeepSpeed on Gaudi

  • If you have an existing training script that runs on Gaudi, it can be converted into DeepSpeed by following the instructions in https://www.deepspeed.ai.

  • In order to align both training script and DeepSpeed to use Gaudi, it is highly recommended to use only the dedicated DeepSpeed flag use_hpu inside the training script.

  • You can provide a DeepSpeed json config file to specify a collection of training settings. See https://www.deepspeed.ai/docs/config-json/.

It is highly recommended to review one of the examples located under DeepSpeed Examples.