Create AWS Batch Compute Environment

AWS Batch is comprised of a compute environment, job queue, job definition, and jobs.

This section focuses on:

  • Creating an EC2 Launch Template

  • Creating a Compute Environment

  • Creating a Job Queue

Create an EC2 Launch Template

A Launch Template is a configurable base configuration for each node in the AWS Batch Cluster. To create a launch template, follow the below steps:

  1. Find the latest Habana ECS AMI ID to set up the node:

aws ec2 describe-images  --region us-west-2 --filters "Name=name,Values=habanalabs-ecs*" --query 'Images[].{Name: Name, ImageID: ImageId}'

Note

Having more than one EFA network interface prevents allocating a public IP address. To connect to the instance, one option is to assign an Elastic IP address.

  1. Create an ecs_launch_template.json file with the following configuration and update the placeholders:

{
  "DryRun": false,
  "LaunchTemplateName": "ECS_DL1_EFS",
  "VersionDescription": "Override Template",
  "LaunchTemplateData": {
    "IamInstanceProfile": {
      "Arn": "INSTANCE_PROFILE"
    },
    "BlockDeviceMappings": [
      {
        "DeviceName": "/dev/xvda",
        "Ebs": {
          "VolumeSize": 200,
          "DeleteOnTermination": true
        }
      }
    ],
    "NetworkInterfaces": [
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 0,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 0
      },
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 1,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 1
      },
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 2,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 2
      },
      {
        "AssociatePublicIpAddress": false,
        "DeleteOnTermination": true,
        "DeviceIndex": 3,
        "Groups": [
          "SECURITY_GROUP"
        ],
        "InterfaceType": "efa",
        "Ipv6AddressCount": 0,
        "SubnetId": "SUBNET",
        "NetworkCardIndex": 3
      }
    ],
    "ImageId": "AMI_ID",
    "KeyName": "PEM_KEY_NAME",
    "Monitoring": {
      "Enabled": true
    },
    "DisableApiTermination": false,
    "InstanceInitiatedShutdownBehavior": "stop",
    "UserData": "",
    "TagSpecifications": [
      {
        "ResourceType": "instance",
        "Tags": [
          {
            "Key": "purpose",
            "Value": "batch multinode training"
          }
        ]
      }
    ],
    "MetadataOptions": {
      "HttpTokens": "required",
      "HttpPutResponseHopLimit": 5,
      "HttpEndpoint": "enabled"
    }
  },
  "TagSpecifications": [
    {
      "ResourceType": "launch-template",
      "Tags": [
        {
          "Key": "purpose",
          "Value": "batch training"
        }
      ]
    }
  ]
}

PlaceHolder

Replace

INSTANCE_PROFILE

arn:aws:iam::xxxxxxxxxxx:instance-profile/ecsInstanceRole

SECURITY_GROUP

sg-xxxxxxxxx

SUBNET

subnet-xxxxxxxxx

PEM_KEY_NAME

key_name_no_extension

AMI_ID

ami-xxxxxxxxxxxxxxxxx

  1. Run the following aws command to create a launch template:

aws ec2 create-launch-template --cli-input-json file://ecs_launch_template.json

The launch template should look similar to the below:

{
    "LaunchTemplate": {
        "LaunchTemplateId": "lt-xxxxxxxx",
        "LaunchTemplateName": "ECS_DL1_EFS",
        "CreateTime": "2022-09-06T20:58:49.000Z",
        "CreatedBy": "............",
        "DefaultVersionNumber": 1,
        "LatestVersionNumber": 1,
        "Tags": [
            {
                "Key": "purpose",
                "Value": "batch training"
            }
        ]
    }
}

Create AWS Batch Compute Environment

A Compute Environment manages/scales the AWS EC2 compute resources. Follow the below steps:

  1. Create a dl1_batch_ce.json file with the following configuration and update the placeholders:

{
    "computeEnvironmentName": "dl1_mnp_ce",
    "type": "MANAGED",
    "state": "ENABLED",
    "computeResources": {
    "type": "EC2",
    "minvCpus": 0,
    "maxvCpus": 192,
    "desiredvCpus": 0,
    "launchTemplate": {
        "launchTemplateId": "LAUNCH_TEMPLATE",
        "version": "1"
    },
    "instanceTypes": [
        "dl1.24xlarge"
    ],
    "subnets": [
        "SUBNET"
    ],
    "ec2KeyPair": "PEM_KEY_NAME",
    "instanceRole": "INSTANCE_ROLE",
    "tags": {
        "Name": "dl1_mnp_ce"
    }
    },
    "serviceRole": "BATCH_SERVICE_ROLE"
}

PlaceHolder

Replace

INSTANCE_ROLE

arn:aws:iam::xxxxxxxx:instance-profile/ecsInstanceRole

LAUNCH_TEMPLATE

lt-xxxxxxxx

SUBNET

subnet-xxxxxxxxx

PEM_KEY_NAME

key_name_no_extension

BATCH_SERVICE_ROLE

arn:aws:iam::xxxxxxxxxxx:role/service-role/AWSBatchServiceRole

Note

DL1s require 96 vCPUs to run per instance. The maxvCpus field sets the ceiling for how many DL1s can be run in this compute environment. For this example, we are running 2 DL1s which require 192 vCPUs. The maxvCpus can be updated after creation to launch higher scales of DL1s.

  1. Run the aws command to create a compute environment:

aws batch create-compute-environment --cli-input-json file://dl1_batch_ce.json

The output should look similar to the below:

{
    "computeEnvironmentName": "dl1_mnp_ce",
    "computeEnvironmentArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:compute-environment/dl1_mnp_ce"
}

Create AWS Job Queue

A Job Queue holds submitted jobs until they are scheduled to run in a compute environment. Follow the below steps:

  1. Create a dl1_batch_jq.json file with the following configuration and update the placeholders:

{
    "jobQueueName": "dl1_mnp_jq",
    "state": "ENABLED",
    "priority": 1,
    "computeEnvironmentOrder": [
        {
            "order": 1,
            "computeEnvironment": "dl1_mnp_ce"
        }
    ]
}
  1. Run the aws command to create a job queue:

aws batch create-job-queue --cli-input-json file://dl1_batch_jq.json

The output should look similar to the below:

{
    "jobQueueName": "dl1_mnp_jq",
    "jobQueueArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-queue/dl1_mnp_jq"
}