PyTorch Model on AWS DL1 Instance Quick Start

This document provides quick steps to set up an Amazon EC2 DL1 instance and train a PyTorch model.

Prerequisites

This document assumes the following:

Create an EC2 Instance

Follow the below step-by-step instructions to launch a Habana EC2 instance.

Initiate Instance Launch

  1. Open the Amazon EC2 Launch Console.

  2. In Name and tags, enter a name for the AMI. In this example, we chose habana-quick-start for the name.

  3. In Application and OS Images, search for Habana to obtain a list of Habana AMIs.

../_images/Instance_name.png

A list of Habana AMIs will appear. The naming convention for AMI Names is: Deep Learning AMI Habana [TensorFlow, PyTorch] SynapseAI [Version].

  1. Locate the AMI you would like to use and click on the Select button. In this example, we selected Deep Learning AMI Habana PyTorch 1.11.0 SynapseAI 1.5.0.

../_images/DLAMI.png

Choose Instance Type

Choose the dl1.24xlarge instance type to run on Gaudi.

../_images/Instance_type.png

Select Key Pair

From the dropdown menu, select the key pair to be used for accessing the instance with ssh or click on the Create new key pair button:

../_images/Key_pair.png

Configure Network

Make sure to configure your network settings according to your setup. To keep this example simple, we chose a network open to the public. It is recommended you set the security group rules that allow access from known IP addresses only.

../_images/Instance_network.png

Configure Storage

Choose the desired storage size for your EC2 instance. In this example, we used 500GB.

../_images/Instance_storage.png

Review

Verify your configuration is correct and select the Launch instance button.

You have now launched an EC2 Instance.

Connect to your Instance

Using ssh you can connect to the instance that you launched. The below shows an example command:

ssh -i ~/.ssh/key_pair.pem [email protected]_DNS

key_pair.pem -- The key produced in :ref:`Select_Key_pair`
[email protected] -- username for ubuntu AMI is "[email protected]", if using AML2 the username is "[email protected]"
PUBLIC_DNS -- under the Public IPV4 DNS section in the instance details

For more details, please refer to Connect to your Linux instance using an SSH Client Guide.

Launch the Example Model on Gaudi

  1. Clone the Model References repository inside the container that you have just started:

git clone https://github.com/HabanaAI/Model-References.git -b 1.6.0
  1. Move to the subdirectory containing the hello_world example:

cd Model-References/PyTorch/examples/computer_vision/hello_world/
  1. Update PYTHONPATH to include Model-References repository and set PYTHON to python executable:

export PYTHONPATH=$PYTHONPATH:Model-References
export PYTHON=/usr/bin/python3.8

Launch Model Training on a Single Gaudi (HPU)

Run training on single HPU in BF16 with hmp (Habana mixed precision) enabled:

$PYTHON mnist.py --batch-size=64 --epochs=1 --lr=1.0 \
      --gamma=0.7 --hpu --hmp \
      --hmp-bf16=ops_bf16_mnist.txt \
      --hmp-fp32=ops_fp32_mnist.txt \
      --use_lazy_mode

hmp:verbose_mode  False
hmp:opt_level O1

Not using distributed mode
Train Epoch: 1 [0/60000.0 (0%)] Loss: 2.307337
Train Epoch: 1 [640/60000.0 (1%)]       Loss: 1.365518
......
Train Epoch: 1 [58880/60000.0 (98%)]    Loss: 0.002533
Train Epoch: 1 [59520/60000.0 (99%)]    Loss: 0.023411
......

Total test set: 10000, number of workers: 1
* Average Acc 98.490 Average loss 0.044

Launch Distributed Model Training on 8 Gaudis (HPUs)

Run training on 8 HPUs in BF16 with hmp (Habana mixed precision) enabled:

mpirun -n 8 --bind-to core --map-by slot:PE=6 \
      --rank-by core --report-bindings \
      --allow-run-as-root \
      $PYTHON mnist.py \
      --batch-size=64 --epochs=1 \
      --lr=1.0 --gamma=0.7 \
      --hpu --hmp --hmp-bf16=ops_bf16_mnist.txt \
      --hmp-fp32=ops_fp32_mnist.txt \
      --use_lazy_mode

hmp:verbose_mode  False
hmp:opt_level O1
......
hmp:opt_level O1
hmp:verbose_mode  False

| distributed init (rank 0): env://
| distributed init (rank 3): env://
| distributed init (rank 5): env://
| distributed init (rank 6): env://
| distributed init (rank 4): env://
| distributed init (rank 7): env://
| distributed init (rank 1): env://
| distributed init (rank 2): env://

Train Epoch: 1 [0/7500.0 (0%)]  Loss: 2.328997
Train Epoch: 1 [640/7500.0 (9%)]        Loss: 1.159214
Train Epoch: 1 [1280/7500.0 (17%)]      Loss: 0.587595
Train Epoch: 1 [1920/7500.0 (26%)]      Loss: 0.370976
Train Epoch: 1 [2560/7500.0 (34%)]      Loss: 0.295102
Train Epoch: 1 [3200/7500.0 (43%)]      Loss: 0.142277
Train Epoch: 1 [3840/7500.0 (51%)]      Loss: 0.130573
Train Epoch: 1 [4480/7500.0 (60%)]      Loss: 0.138563
Train Epoch: 1 [5120/7500.0 (68%)]      Loss: 0.101324
Train Epoch: 1 [5760/7500.0 (77%)]      Loss: 0.135026
Train Epoch: 1 [6400/7500.0 (85%)]      Loss: 0.055890
Train Epoch: 1 [7040/7500.0 (94%)]      Loss: 0.101984

Total test set: 10000, number of workers: 8
* Average Acc 97.862 Average loss 0.067

Now you can start your own model training on HPU.

What’s Next?

For an in depth guide of getting started with Gaudi, please refer to our Developer site.