Getting Started with DeepSpeed

This guide provides simple steps for preparing a DeepSpeed model to run on Intel® Gaudi® AI accelerator. Make sure to install the DeepSpeed package provided by Intel Gaudi. Installing public DeepSpeed packages is not supported.

To set up the environment, refer to the Installation Guide and On-Premise System Update. The supported DeepSpeed versions are listed in the Support Matrix.

Start Training a DeepSpeed Model on Gaudi

  1. Run the Intel Gaudi Docker image:

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
  1. Install Intel Gaudi DeepSpeed fork:

pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.1
  1. Clone the Model References repository inside the container that you have just started:

git clone https://github.com/HabanaAI/Model-References.git
  1. Move to the subdirectory containing the cifar_example:

cd Model-References/PyTorch/examples/DeepSpeed/cifar_example/

Note

The model defined in the cifar10_Deepspeed.py script script is a simple CNN based model which loads the CIFAR-10 dataset automatically.

  1. Install the associated requirements:

pip install -r requirements.txt
  1. Update PYTHONPATH to include Model-References repository and set PYTHON to python executable:

export PYTHONPATH=$PYTHONPATH:Model-References
export PYTHON=/usr/bin/python3.8
  1. Execute the run_ds_habanax8.sh script. If you are running on a single Gaudi, modify the script to set --num_gpus=1.

deepspeed --num_nodes=1 --num_gpus=8 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json --use_hpu

The following should appear as part of the output:

[10,  2000] loss: 0.776
[10,  2000] loss: 0.760
[10,  2000] loss: 0.747
[10,  2000] loss: 0.753
[10,  2000] loss: 0.759
[10,  2000] loss: 0.776
[10,  2000] loss: 0.772
[10,  2000] loss: 0.776
Finished Training
GroundTruth:    cat  ship   ship plane
Predicted:      cat  ship   ship plane
Accuracy of the network on the 10000 test images: 59 %
Accuracy of  ship : 70 %
Accuracy of truck : 57 %
[2022-10-28 17:17:55,740] [INFO] [launch.py:212:main] Process 815 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 818 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 820 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 814 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 817 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 819 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 816 exits successfully.
[2022-10-28 17:17:56,742] [INFO] [launch.py:212:main] Process 813 exits successfully.

To start training your own DeepSpeed models on Gaudi, refer to DeepSpeed User Guide for Training.