Create AWS Batch Compute Environment

AWS Batch is comprised of a compute environment, job queue, job definition, and jobs.

This section focuses on:

  • Creating an EC2 Launch Template

  • Creating a Compute Environment

  • Creating a Job Queue

Create an EC2 Launch Template

A Launch Template is a configurable base configuration for each node in the AWS Batch Cluster. To create a launch template, follow the below steps:

  1. Find the latest Habana ECS AMI ID to set up the node:

    aws ec2 describe-images  --region us-west-2 --filters "Name=name,Values=habanalabs-ecs*" --query 'Images[].{Name: Name, ImageID: ImageId}'
    

    Note

    Having more than one EFA network interface prevents allocating a public IP address. To connect to the instance, one option is to assign an Elastic IP address.

  2. Create an ecs_launch_template.json file with the following configuration, and update the placeholders as described in the table below:

    {
      "DryRun": false,
      "LaunchTemplateName": "ECS_DL1_EFS",
      "VersionDescription": "Override Template",
      "LaunchTemplateData": {
        "IamInstanceProfile": {
          "Arn": "INSTANCE_PROFILE"
        },
        "BlockDeviceMappings": [
          {
            "DeviceName": "/dev/xvda",
            "Ebs": {
              "VolumeSize": 200,
              "DeleteOnTermination": true
            }
          }
        ],
        "NetworkInterfaces": [
          {
            "AssociatePublicIpAddress": false,
            "DeleteOnTermination": true,
            "DeviceIndex": 0,
            "Groups": [
              "SECURITY_GROUP"
            ],
            "InterfaceType": "efa",
            "Ipv6AddressCount": 0,
            "SubnetId": "SUBNET",
            "NetworkCardIndex": 0
          },
          {
            "AssociatePublicIpAddress": false,
            "DeleteOnTermination": true,
            "DeviceIndex": 1,
            "Groups": [
              "SECURITY_GROUP"
            ],
            "InterfaceType": "efa",
            "Ipv6AddressCount": 0,
            "SubnetId": "SUBNET",
            "NetworkCardIndex": 1
          },
          {
            "AssociatePublicIpAddress": false,
            "DeleteOnTermination": true,
            "DeviceIndex": 2,
            "Groups": [
              "SECURITY_GROUP"
            ],
            "InterfaceType": "efa",
            "Ipv6AddressCount": 0,
            "SubnetId": "SUBNET",
            "NetworkCardIndex": 2
          },
          {
            "AssociatePublicIpAddress": false,
            "DeleteOnTermination": true,
            "DeviceIndex": 3,
            "Groups": [
              "SECURITY_GROUP"
            ],
            "InterfaceType": "efa",
            "Ipv6AddressCount": 0,
            "SubnetId": "SUBNET",
            "NetworkCardIndex": 3
          }
        ],
        "ImageId": "AMI_ID",
        "KeyName": "PEM_KEY_NAME",
        "Monitoring": {
          "Enabled": true
        },
        "DisableApiTermination": false,
        "InstanceInitiatedShutdownBehavior": "stop",
        "UserData": "",
        "TagSpecifications": [
          {
            "ResourceType": "instance",
            "Tags": [
              {
                "Key": "purpose",
                "Value": "batch multinode training"
              }
            ]
          }
        ],
        "MetadataOptions": {
          "HttpTokens": "required",
          "HttpPutResponseHopLimit": 5,
          "HttpEndpoint": "enabled"
        }
      },
      "TagSpecifications": [
        {
          "ResourceType": "launch-template",
          "Tags": [
            {
              "Key": "purpose",
              "Value": "batch training"
            }
          ]
        }
      ]
    }
    

    PlaceHolder

    Replace

    INSTANCE_PROFILE

    arn:aws:iam::xxxxxxxxxxx:instance-profile/ecsInstanceRole

    SECURITY_GROUP

    sg-xxxxxxxxx

    SUBNET

    subnet-xxxxxxxxx

    PEM_KEY_NAME

    key_name_no_extension

    AMI_ID

    ami-xxxxxxxxxxxxxxxxx

  3. Create a launch template:

    aws ec2 create-launch-template --cli-input-json file://ecs_launch_template.json
    

    The launch template should look similar to the below:

    {
        "LaunchTemplate": {
            "LaunchTemplateId": "lt-xxxxxxxx",
            "LaunchTemplateName": "ECS_DL1_EFS",
            "CreateTime": "2022-09-06T20:58:49.000Z",
            "CreatedBy": "............",
            "DefaultVersionNumber": 1,
            "LatestVersionNumber": 1,
            "Tags": [
                {
                    "Key": "purpose",
                    "Value": "batch training"
                }
            ]
        }
    }
    

Create AWS Batch Compute Environment

A Compute Environment manages/scales the AWS EC2 compute resources. Follow the below steps:

  1. Create a dl1_batch_ce.json file with the following configuration, and update the placeholders as described in the table below:

    {
        "computeEnvironmentName": "dl1_mnp_ce",
        "type": "MANAGED",
        "state": "ENABLED",
        "computeResources": {
        "type": "EC2",
        "minvCpus": 0,
        "maxvCpus": 192,
        "desiredvCpus": 0,
        "launchTemplate": {
            "launchTemplateId": "LAUNCH_TEMPLATE",
            "version": "1"
        },
        "instanceTypes": [
            "dl1.24xlarge"
        ],
        "subnets": [
            "SUBNET"
        ],
        "ec2KeyPair": "PEM_KEY_NAME",
        "instanceRole": "INSTANCE_ROLE",
        "tags": {
            "Name": "dl1_mnp_ce"
        }
        },
        "serviceRole": "BATCH_SERVICE_ROLE"
    }
    

    PlaceHolder

    Replace

    INSTANCE_ROLE

    arn:aws:iam::xxxxxxxx:instance-profile/ecsInstanceRole

    LAUNCH_TEMPLATE

    lt-xxxxxxxx

    SUBNET

    subnet-xxxxxxxxx

    PEM_KEY_NAME

    key_name_no_extension

    BATCH_SERVICE_ROLE

    arn:aws:iam::xxxxxxxxxxx:role/service-role/AWSBatchServiceRole

Note

DL1s require 96 vCPUs to run per instance. The maxvCpus field sets the ceiling for how many DL1s can be run in this compute environment. For this example, we are running 2 DL1s which require 192 vCPUs. The maxvCpus can be updated after creation to launch higher scales of DL1s.

  1. Create a compute environment:

    aws batch create-compute-environment --cli-input-json file://dl1_batch_ce.json
    

    The output should look similar to the below:

    {
        "computeEnvironmentName": "dl1_mnp_ce",
        "computeEnvironmentArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:compute-environment/dl1_mnp_ce"
    }
    

Create AWS Job Queue

A Job Queue holds submitted jobs until they are scheduled to run in a compute environment. Follow the below steps:

  1. Create a dl1_batch_jq.json file with the following configuration and update the placeholders:

    {
        "jobQueueName": "dl1_mnp_jq",
        "state": "ENABLED",
        "priority": 1,
        "computeEnvironmentOrder": [
            {
                "order": 1,
                "computeEnvironment": "dl1_mnp_ce"
            }
        ]
    }
    
  2. Create a job queue:

    aws batch create-job-queue --cli-input-json file://dl1_batch_jq.json
    

    The output should look similar to the below:

    {
        "jobQueueName": "dl1_mnp_jq",
        "jobQueueArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-queue/dl1_mnp_jq"
    }