Create AWS Batch Compute Environment¶

AWS Batch is comprised of a compute environment, job queue, job definition, and jobs.

This section focuses on:

Creating an EC2 Launch Template
Creating a Compute Environment
Creating a Job Queue

Create an EC2 Launch Template¶

A Launch Template is a configurable base configuration for each node in the AWS Batch Cluster. To create a launch template, follow the below steps:

Find the latest Habana ECS AMI ID to set up the node:
```
aws ec2 describe-images  --region us-west-2 --filters "Name=name,Values=habanalabs-ecs*" --query 'Images[].{Name: Name, ImageID: ImageId}'
```
Note

Having more than one EFA network interface prevents allocating a public IP address. To connect to the instance, one option is to assign an Elastic IP address.

Create an ecs_launch_template.json file with the following configuration, and update the placeholders as described in the table below:

{
  "DryRun": false,
  "LaunchTemplateName": "ECS_DL1_EFS",
  "VersionDescription": "Override Template",
  "LaunchTemplateData": {
    "IamInstanceProfile": {
      "Arn": "INSTANCE_PROFILE"
    },
    "BlockDeviceMappings": [
      {
        "DeviceName": "/dev/xvda",
        "Ebs": {
          "VolumeSize": 200,
          "DeleteOnTermination": true
        }
      }
    ],
    "NetworkInterfaces": [
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 0,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 0
      },
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 1,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 1
      },
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 2,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 2
      },
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 3,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 3
      }
    ],
    "ImageId": "AMI_ID",
    "KeyName": "PEM_KEY_NAME",
    "Monitoring": {
      "Enabled": true
    },
    "DisableApiTermination": false,
    "InstanceInitiatedShutdownBehavior": "stop",
    "UserData": "",
    "TagSpecifications": [
      {
        "ResourceType": "instance",
        "Tags": [
          {
            "Key": "purpose",
            "Value": "batch multinode training"
          }
        ]
      }
    ],
    "MetadataOptions": {
      "HttpTokens": "required",
      "HttpPutResponseHopLimit": 5,
      "HttpEndpoint": "enabled"
    }
  },
  "TagSpecifications": [
    {
      "ResourceType": "launch-template",
      "Tags": [
        {
          "Key": "purpose",
          "Value": "batch training"
        }
      ]
    }
  ]
}

PlaceHolder	Replace
INSTANCE_PROFILE	arn:aws:iam::xxxxxxxxxxx:instance-profile/ecsInstanceRole
SECURITY_GROUP	sg-xxxxxxxxx
SUBNET	subnet-xxxxxxxxx
PEM_KEY_NAME	key_name_no_extension
AMI_ID	ami-xxxxxxxxxxxxxxxxx

Create a launch template:

aws ec2 create-launch-template --cli-input-json file://ecs_launch_template.json

The launch template should look similar to the below:

{
    "LaunchTemplate": {
        "LaunchTemplateId": "lt-xxxxxxxx",
        "LaunchTemplateName": "ECS_DL1_EFS",
        "CreateTime": "2022-09-06T20:58:49.000Z",
        "CreatedBy": "............",
        "DefaultVersionNumber": 1,
        "LatestVersionNumber": 1,
        "Tags": [
            {
                "Key": "purpose",
                "Value": "batch training"
            }
        ]
    }
}

Create AWS Batch Compute Environment¶

A Compute Environment manages/scales the AWS EC2 compute resources. Follow the below steps:

Create a dl1_batch_ce.json file with the following configuration, and update the placeholders as described in the table below:

{
    "computeEnvironmentName": "dl1_mnp_ce",
    "type": "MANAGED",
    "state": "ENABLED",
    "computeResources": {
    "type": "EC2",
    "minvCpus": 0,
    "maxvCpus": 192,
    "desiredvCpus": 0,
    "launchTemplate": {
        "launchTemplateId": "LAUNCH_TEMPLATE",
        "version": "1"
    },
    "instanceTypes": [
        "dl1.24xlarge"
    ],
    "subnets": [
        "SUBNET"
    ],
    "ec2KeyPair": "PEM_KEY_NAME",
    "instanceRole": "INSTANCE_ROLE",
    "tags": {
        "Name": "dl1_mnp_ce"
    }
    },
    "serviceRole": "BATCH_SERVICE_ROLE"
}

PlaceHolder	Replace
INSTANCE_ROLE	arn:aws:iam::xxxxxxxx:instance-profile/ecsInstanceRole
LAUNCH_TEMPLATE	lt-xxxxxxxx
SUBNET	subnet-xxxxxxxxx
PEM_KEY_NAME	key_name_no_extension
BATCH_SERVICE_ROLE	arn:aws:iam::xxxxxxxxxxx:role/service-role/AWSBatchServiceRole

Note

DL1s require 96 vCPUs to run per instance. The maxvCpus field sets the ceiling for how many DL1s can be run in this compute environment. For this example, we are running 2 DL1s which require 192 vCPUs. The maxvCpus can be updated after creation to launch higher scales of DL1s.

Create a compute environment:

aws batch create-compute-environment --cli-input-json file://dl1_batch_ce.json

The output should look similar to the below:

{
    "computeEnvironmentName": "dl1_mnp_ce",
    "computeEnvironmentArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:compute-environment/dl1_mnp_ce"
}

Create AWS Job Queue¶

A Job Queue holds submitted jobs until they are scheduled to run in a compute environment. Follow the below steps:

Create a dl1_batch_jq.json file with the following configuration and update the placeholders:

{
    "jobQueueName": "dl1_mnp_jq",
    "state": "ENABLED",
    "priority": 1,
    "computeEnvironmentOrder": [
        {
            "order": 1,
            "computeEnvironment": "dl1_mnp_ce"
        }
    ]
}

Create a job queue:

aws batch create-job-queue --cli-input-json file://dl1_batch_jq.json

The output should look similar to the below:

{
    "jobQueueName": "dl1_mnp_jq",
    "jobQueueArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-queue/dl1_mnp_jq"
}

Gaudi Documentation 1.21.1 documentation

Create AWS Batch Compute Environment

On this Page

Create AWS Batch Compute Environment¶

Create an EC2 Launch Template¶

Create AWS Batch Compute Environment¶

Create AWS Job Queue¶