AWS DL1 Quick Start Guide

This document provides instructions to set up an Amazon EC2 DL1 instance and start training a PyTorch model.

Prerequisites

  • You have an AWS account - https://aws.amazon.com.

  • You are using us-west-2 or us-east-1 which is where AWS EC2 DL1 instances are available.

Create an EC2 Instance

Follow the below step-by-step instructions to launch an EC2 DL1 instance.

Initiate Instance Launch

  1. Open the Amazon EC2 Launch Console.

  2. In Name and tags, enter a name for the AMI. In this example, we chose habana-quick-start for the name.

  3. In Application and OS Images, search for Habana Deep Learning Base AMI and choose the desired operating system. These are located in the AWS Marketplace AMIs:

../_images/Instance_name.png
  1. Click on the Select button:

../_images/Select_Instance.png

Choose Instance Type

Choose the dl1.24xlarge instance type to run on Gaudi.

../_images/Instance_type.png

Select Key Pair

If you have an existing key pair, select the key pair to be used for accessing the instance with ssh from the dropdown menu or use a new key pair by clicking on the Create new key pair button.

../_images/Key_pair.png

Configure Network

Make sure to configure your network settings according to your setup. To keep this example simple, we chose a network open to the public. It is recommended you set the security group rules that allow access from known IP addresses only.

../_images/Instance_network.png

Configure Storage

Choose the desired storage size for your EC2 instance. In this example, we used 500GB.

../_images/Instance_storage.png

Review

Verify your configuration is correct and select the Launch instance button.

Note

When launching a Base AMI for the first time, the subscription process may take a while.

You have now launched an EC2 Instance.

Connect to your Instance

Using ssh you can connect to the instance that you launched. Make sure to update the following values to run the below command:

ssh -i ~/.ssh/"key_pair.pem" ubuntu@"PUBLIC_DNS"

key_pair.pem -- Use the key produced in Select Key Pair
PUBLIC_DNS -- You can find this parameter under the Public IPV4 DNS section in the instance details

Alternatively, when selecting the instance in your AWS Console, you can click on the Connect button and use the commands tailored for your instance under the SSH client tab.

For more details, please refer to Connect to your Linux instance using an SSH Client Guide.

Start Training a PyTorch Model on Gaudi

  1. Run the Habana Docker image:

    docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.10.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest
  1. Clone the Model References repository inside the container that you have just started:

git clone https://github.com/HabanaAI/Model-References.git
  1. Move to the subdirectory containing the hello_world example:

cd Model-References/PyTorch/examples/computer_vision/hello_world/
  1. Update PYTHONPATH to include Model-References repository and set PYTHON to python executable:

export PYTHONPATH=$PYTHONPATH:Model-References
export PYTHON=/usr/bin/python3.8

Training on a Single Gaudi (HPU) Device

Run training on single HPU in BF16 with hmp (Habana mixed precision) enabled:

$PYTHON mnist.py --batch-size=64 --epochs=1 --lr=1.0 \
      --gamma=0.7 --hpu --hmp \
      --hmp-bf16=ops_bf16_mnist.txt \
      --hmp-fp32=ops_fp32_mnist.txt \
      --use_lazy_mode
hmp:verbose_mode  False
hmp:opt_level O1

Not using distributed mode

Train Epoch: 1 [0/60000.0 (0%)] Loss: 2.307337
Train Epoch: 1 [640/60000.0 (1%)]       Loss: 1.365518
......
Train Epoch: 1 [58880/60000.0 (98%)]    Loss: 0.002533
Train Epoch: 1 [59520/60000.0 (99%)]    Loss: 0.023411
......

Total test set: 10000, number of workers: 1
* Average Acc 98.490 Average loss 0.044

Distributed Training on 8 Gaudis (HPUs)

Run training on 8 HPUs in BF16 with hmp (Habana mixed precision) enabled:

mpirun -n 8 --bind-to core --map-by slot:PE=6 \
      --rank-by core --report-bindings \
      --allow-run-as-root \
      $PYTHON mnist.py \
      --batch-size=64 --epochs=1 \
      --lr=1.0 --gamma=0.7 \
      --hpu --hmp --hmp-bf16=ops_bf16_mnist.txt \
      --hmp-fp32=ops_fp32_mnist.txt \
      --use_lazy_mode
hmp:verbose_mode  False
hmp:opt_level O1
......
hmp:opt_level O1
hmp:verbose_mode  False

| distributed init (rank 0): env://
| distributed init (rank 3): env://
| distributed init (rank 5): env://
| distributed init (rank 6): env://
| distributed init (rank 4): env://
| distributed init (rank 7): env://
| distributed init (rank 1): env://
| distributed init (rank 2): env://

Train Epoch: 1 [0/7500.0 (0%)]  Loss: 2.328997
Train Epoch: 1 [640/7500.0 (9%)]        Loss: 1.159214
Train Epoch: 1 [1280/7500.0 (17%)]      Loss: 0.587595
Train Epoch: 1 [1920/7500.0 (26%)]      Loss: 0.370976
Train Epoch: 1 [2560/7500.0 (34%)]      Loss: 0.295102
Train Epoch: 1 [3200/7500.0 (43%)]      Loss: 0.142277
Train Epoch: 1 [3840/7500.0 (51%)]      Loss: 0.130573
Train Epoch: 1 [4480/7500.0 (60%)]      Loss: 0.138563
Train Epoch: 1 [5120/7500.0 (68%)]      Loss: 0.101324
Train Epoch: 1 [5760/7500.0 (77%)]      Loss: 0.135026
Train Epoch: 1 [6400/7500.0 (85%)]      Loss: 0.055890
Train Epoch: 1 [7040/7500.0 (94%)]      Loss: 0.101984

Total test set: 10000, number of workers: 8
* Average Acc 97.862 Average loss 0.067

Now you have successfully launched a Gaudi-based EC2 DL1 instance and trained a simple PyTorch model on Gaudi. Now you can start training your own models on HPU.