Getting Started with DeepSpeed
On this Page
Getting Started with DeepSpeed¶
This guide provides simple steps for preparing a DeepSpeed model to run on Intel® Gaudi® AI accelerator. Make sure to install the DeepSpeed package provided by Intel Gaudi. Installing public DeepSpeed packages is not supported.
To set up the environment, refer to the Installation Guide and On-Premise System Update. The supported DeepSpeed versions are listed in the Support Matrix.
Start Training a DeepSpeed Model on Gaudi¶
Run the Intel Gaudi Docker image:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
Install Intel Gaudi DeepSpeed fork:
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.1
Clone the Model References repository inside the container that you have just started:
git clone https://github.com/HabanaAI/Model-References.git
Move to the subdirectory containing the cifar_example:
cd Model-References/PyTorch/examples/DeepSpeed/cifar_example/
Note
The model defined in the cifar10_Deepspeed.py script script is a simple CNN based model which loads the CIFAR-10 dataset automatically.
Install the associated requirements:
pip install -r requirements.txt
Update PYTHONPATH to include Model-References repository and set PYTHON to python executable:
export PYTHONPATH=$PYTHONPATH:Model-References
export PYTHON=/usr/bin/python3.8
Execute the
run_ds_habanax8.sh
script. If you are running on a single Gaudi, modify the script to set--num_gpus=1
.
deepspeed --num_nodes=1 --num_gpus=8 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json --use_hpu
The following should appear as part of the output:
[10, 2000] loss: 0.776
[10, 2000] loss: 0.760
[10, 2000] loss: 0.747
[10, 2000] loss: 0.753
[10, 2000] loss: 0.759
[10, 2000] loss: 0.776
[10, 2000] loss: 0.772
[10, 2000] loss: 0.776
Finished Training
GroundTruth: cat ship ship plane
Predicted: cat ship ship plane
Accuracy of the network on the 10000 test images: 59 %
Accuracy of ship : 70 %
Accuracy of truck : 57 %
[2022-10-28 17:17:55,740] [INFO] [launch.py:212:main] Process 815 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 818 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 820 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 814 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 817 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 819 exits successfully.
[2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 816 exits successfully.
[2022-10-28 17:17:56,742] [INFO] [launch.py:212:main] Process 813 exits successfully.
To start training your own DeepSpeed models on Gaudi, refer to DeepSpeed User Guide for Training.