Getting Started with DeepSpeed
On this Page
Getting Started with DeepSpeed¶
This guide provides simple steps for preparing a DeepSpeed model to run on Intel® Gaudi® AI accelerator. To set up the environment, refer to the Installation Guide and On-Premise System Update. Make sure to install the DeepSpeed package provided by Intel Gaudi as listed in the Support Matrix.
Note
Installing public DeepSpeed packages is not supported.
DeepSpeed is not compatible with
lightning-habana
1.6.0.
Start Training a DeepSpeed Model on Gaudi¶
This example uses the model defined in the cifar10_Deepspeed.py script. It is a simple CNN based model which loads the CIFAR-10 dataset automatically.
Run the Intel Gaudi Docker image:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
Install the Intel Gaudi DeepSpeed fork:
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.18.0
Clone the Model References repository inside the container that you have just started:
git clone https://github.com/HabanaAI/Model-References.git
Move to the subdirectory containing the cifar_example:
cd Model-References/PyTorch/examples/DeepSpeed/cifar_example/
Install the associated requirements:
pip install -r requirements.txt
Update PYTHONPATH to include Model-References repository and set PYTHON to Python executable:
export PYTHONPATH=$PYTHONPATH:Model-References export PYTHON=/usr/bin/python3.10
Note
The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.
Execute the
run_ds_habanax8.sh
script. If you are running on a single Gaudi, modify the script to set--num_gpus=1
.deepspeed --num_nodes=1 --num_gpus=8 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
[10, 2000] loss: 0.776 [10, 2000] loss: 0.760 [10, 2000] loss: 0.747 [10, 2000] loss: 0.753 [10, 2000] loss: 0.759 [10, 2000] loss: 0.776 [10, 2000] loss: 0.772 [10, 2000] loss: 0.776 Finished Training GroundTruth: cat ship ship plane Predicted: cat ship ship plane Accuracy of the network on the 10000 test images: 59 % Accuracy of ship : 70 % Accuracy of truck : 57 % [2022-10-28 17:17:55,740] [INFO] [launch.py:212:main] Process 815 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 818 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 820 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 814 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 817 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 819 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 816 exits successfully. [2022-10-28 17:17:56,742] [INFO] [launch.py:212:main] Process 813 exits successfully.
To start training your own DeepSpeed models on Gaudi, refer to DeepSpeed User Guide for Training.