Getting Started with DeepSpeed
On this Page
Getting Started with DeepSpeed¶
This guide provides simple steps for preparing a DeepSpeed model to run on Intel® Gaudi® AI accelerator. To set up the environment, refer to the Installation Guide and On-Premise System Update. Make sure to install the DeepSpeed package provided by Intel Gaudi as listed in the Support Matrix.
Note
Installing public DeepSpeed packages is not supported.
Start Training a DeepSpeed Model on Gaudi¶
This example uses the model defined in the cifar10_Deepspeed.py script. It is a simple CNN based model which loads the CIFAR-10 dataset automatically.
Run the Intel Gaudi Docker image:
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
Install the Intel Gaudi DeepSpeed fork:
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.18.0
Clone the Model References repository inside the container that you have just started:
git clone https://github.com/HabanaAI/Model-References.git
Move to the subdirectory containing the cifar_example:
cd Model-References/PyTorch/examples/DeepSpeed/cifar_example/
Install the associated requirements:
pip install -r requirements.txt
Update PYTHONPATH to include Model-References repository and set PYTHON to Python executable:
export PYTHONPATH=$PYTHONPATH:Model-References export PYTHON=/usr/bin/python3.10
Note
The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.
Execute the
run_ds_habanax8.sh
script. If you are running on a single Gaudi, modify the script to set--num_gpus=1
.deepspeed --num_nodes=1 --num_gpus=8 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
[10, 2000] loss: 0.776 [10, 2000] loss: 0.760 [10, 2000] loss: 0.747 [10, 2000] loss: 0.753 [10, 2000] loss: 0.759 [10, 2000] loss: 0.776 [10, 2000] loss: 0.772 [10, 2000] loss: 0.776 Finished Training GroundTruth: cat ship ship plane Predicted: cat ship ship plane Accuracy of the network on the 10000 test images: 59 % Accuracy of ship : 70 % Accuracy of truck : 57 % [2022-10-28 17:17:55,740] [INFO] [launch.py:212:main] Process 815 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 818 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 820 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 814 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 817 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 819 exits successfully. [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 816 exits successfully. [2022-10-28 17:17:56,742] [INFO] [launch.py:212:main] Process 813 exits successfully.
To start training your own DeepSpeed models on Gaudi, refer to DeepSpeed User Guide for Training.