Amazon ECS with Gaudi User Guide

This document provides guidelines on how to set up and run distributed DL1 training workloads through AWS Batch. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. It configures an Elastic Container Service (ECS) Cluster to manage the compute resources and scheduling of submitted jobs. Batch provides multi-node parallel (MNP) jobs that run a single job spanning across multiple Amazon EC2 instances. This is ideal for large-scale distributed model training without the hassle of launching, configuring, and managing resources directly.

After setting up the compute environment, you can deploy images and start your AI applications for deep learning, by leveraging Intel® Gaudi® AI accelerator to achieve optimal accelerated training and development.

Before you get started, make sure the following prerequisites needed for running distributed training with Amazons ECS are met:

Note

This steps outlined throughout this document are performed using the AWS CLI. The same steps can be run using the AWS Console.

The following sections provide details on using AWS Batch with a simple MNIST training example:

For advanced training usage, Advanced Model Training Batch Example: ResNet50 section provides a ResNet50 training using a large dataset.