Getting Started with DeepSpeed

This guide provides simple steps for preparing a DeepSpeed model to run on Intel® Gaudi® AI accelerator. To set up the environment, refer to the Installation Guide and On-Premise System Update. Make sure to install the DeepSpeed package provided by Intel Gaudi as listed in the Support Matrix.

Note

  • Installing public DeepSpeed packages is not supported.

  • DeepSpeed is not compatible with lightning-habana 1.6.0.

Start Training a DeepSpeed Model on Gaudi

This example uses the model defined in the cifar10_Deepspeed.py script. It is a simple CNN based model which loads the CIFAR-10 dataset automatically.

  1. Run the Intel Gaudi Docker image:

    docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
    
  2. Install the Intel Gaudi DeepSpeed fork:

    pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.18.0
    
  3. Clone the Model References repository inside the container that you have just started:

    git clone https://github.com/HabanaAI/Model-References.git
    
  4. Move to the subdirectory containing the cifar_example:

    cd Model-References/PyTorch/examples/DeepSpeed/cifar_example/
    
  5. Install the associated requirements:

    pip install -r requirements.txt
    
  6. Update PYTHONPATH to include Model-References repository and set PYTHON to Python executable:

    export PYTHONPATH=$PYTHONPATH:Model-References
    export PYTHON=/usr/bin/python3.10
    

    Note

    The Python version depends on the operating system. Refer to the Support Matrix for a full list of supported operating systems and Python versions.

  7. Execute the run_ds_habanax8.sh script. If you are running on a single Gaudi, modify the script to set --num_gpus=1.

    deepspeed --num_nodes=1 --num_gpus=8 cifar10_deepspeed.py --deepspeed --deepspeed_config ds_config.json
    
    [10,  2000] loss: 0.776
    [10,  2000] loss: 0.760
    [10,  2000] loss: 0.747
    [10,  2000] loss: 0.753
    [10,  2000] loss: 0.759
    [10,  2000] loss: 0.776
    [10,  2000] loss: 0.772
    [10,  2000] loss: 0.776
    Finished Training
    GroundTruth:    cat  ship   ship plane
    Predicted:      cat  ship   ship plane
    Accuracy of the network on the 10000 test images: 59 %
    Accuracy of  ship : 70 %
    Accuracy of truck : 57 %
    [2022-10-28 17:17:55,740] [INFO] [launch.py:212:main] Process 815 exits successfully.
    [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 818 exits successfully.
    [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 820 exits successfully.
    [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 814 exits successfully.
    [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 817 exits successfully.
    [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 819 exits successfully.
    [2022-10-28 17:17:55,741] [INFO] [launch.py:212:main] Process 816 exits successfully.
    [2022-10-28 17:17:56,742] [INFO] [launch.py:212:main] Process 813 exits successfully.
    

To start training your own DeepSpeed models on Gaudi, refer to DeepSpeed User Guide for Training.