Getting Started with DeepSpeed

This guide provides simple steps for preparing a DeepSpeed model to run on Gaudi. Make sure to install the DeepSpeed package provided by Habana. Installing public DeepSpeed packages is not supported.

To set up the environment, refer to the Installation Guide. The supported DeepSpeed versions are listed in the Support Matrix.

Start Training a DeepSpeed Model on Gaudi

  1. Run the Habana Docker image:

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
  1. Install Habana’s DeepSpeed fork:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.11.0
  1. Clone the Model References repository inside the container that you have just started:

git clone https://github.com/HabanaAI/Model-References.git
  1. Move to the subdirectory containing the cifar_example:

cd Model-References/PyTorch/examples/DeepSpeed/cifar_example/

Note

The model defined in the cifar10_Deepspeed.py script script is a simple CNN based model which loads the CIFAR-10 dataset automatically.

  1. Install the associated requirements:

pip install -r requirements.txt
  1. Update PYTHONPATH to include Model-References repository and set PYTHON to python executable:

export PYTHONPATH=$PYTHONPATH:Model-References
export PYTHON=/usr/bin/python3.8
  1. Execute the run_ds_habanax8.sh script. If you are running on a single Gaudi, modify the script to set --num_gpus=1.

deepspeed --num_nodes=1 --num_gpus=8 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json --use_hpu

The following should appear as part of the output:

[10,  2000] loss: 0.776
[10,  2000] loss: 0.760
[10,  2000] loss: 0.747
[10,  2000] loss: 0.753
[10,  2000] loss: 0.759
[10,  2000] loss: 0.776
[10,  2000] loss: 0.772
[10,  2000] loss: 0.776
Finished Training
GroundTruth:    cat  ship   ship plane
Predicted:      cat  ship   ship plane
Accuracy of the network on the 10000 test images: 59 %
Accuracy of  ship : 70 %
Accuracy of truck : 57 %
[2022-10-28 17:17:55,740] [INFO] [launch.py:212:main] Process 815 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 818 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 820 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 814 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 817 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 819 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 816 exits successfully.
[2022-10-28 17:17:56,742] [INFO] [launch.py:212:main] Process 813 exits successfully.

To start training your own DeepSpeed models on Gaudi, refer to DeepSpeed User Guide for Training.