Create AWS Batch Compute Environment
On this Page
Create AWS Batch Compute Environment¶
AWS Batch is comprised of a compute environment, job queue, job definition, and jobs.
This section focuses on:
Creating an EC2 Launch Template
Creating a Compute Environment
Creating a Job Queue
Create an EC2 Launch Template¶
A Launch Template is a configurable base configuration for each node in the AWS Batch Cluster. To create a launch template, follow the below steps:
Find the latest Habana ECS AMI ID to set up the node:
aws ec2 describe-images --region us-west-2 --filters "Name=name,Values=habanalabs-ecs*" --query 'Images[].{Name: Name, ImageID: ImageId}'
Note
Having more than one EFA network interface prevents allocating a public IP address. To connect to the instance, one option is to assign an Elastic IP address.
Create an
ecs_launch_template.json
file with the following configuration, and update the placeholders as described in the table below:{ "DryRun": false, "LaunchTemplateName": "ECS_DL1_EFS", "VersionDescription": "Override Template", "LaunchTemplateData": { "IamInstanceProfile": { "Arn": "INSTANCE_PROFILE" }, "BlockDeviceMappings": [ { "DeviceName": "/dev/xvda", "Ebs": { "VolumeSize": 200, "DeleteOnTermination": true } } ], "NetworkInterfaces": [ { "AssociatePublicIpAddress": false, "DeleteOnTermination": true, "DeviceIndex": 0, "Groups": [ "SECURITY_GROUP" ], "InterfaceType": "efa", "Ipv6AddressCount": 0, "SubnetId": "SUBNET", "NetworkCardIndex": 0 }, { "AssociatePublicIpAddress": false, "DeleteOnTermination": true, "DeviceIndex": 1, "Groups": [ "SECURITY_GROUP" ], "InterfaceType": "efa", "Ipv6AddressCount": 0, "SubnetId": "SUBNET", "NetworkCardIndex": 1 }, { "AssociatePublicIpAddress": false, "DeleteOnTermination": true, "DeviceIndex": 2, "Groups": [ "SECURITY_GROUP" ], "InterfaceType": "efa", "Ipv6AddressCount": 0, "SubnetId": "SUBNET", "NetworkCardIndex": 2 }, { "AssociatePublicIpAddress": false, "DeleteOnTermination": true, "DeviceIndex": 3, "Groups": [ "SECURITY_GROUP" ], "InterfaceType": "efa", "Ipv6AddressCount": 0, "SubnetId": "SUBNET", "NetworkCardIndex": 3 } ], "ImageId": "AMI_ID", "KeyName": "PEM_KEY_NAME", "Monitoring": { "Enabled": true }, "DisableApiTermination": false, "InstanceInitiatedShutdownBehavior": "stop", "UserData": "", "TagSpecifications": [ { "ResourceType": "instance", "Tags": [ { "Key": "purpose", "Value": "batch multinode training" } ] } ], "MetadataOptions": { "HttpTokens": "required", "HttpPutResponseHopLimit": 5, "HttpEndpoint": "enabled" } }, "TagSpecifications": [ { "ResourceType": "launch-template", "Tags": [ { "Key": "purpose", "Value": "batch training" } ] } ] }
PlaceHolder
Replace
INSTANCE_PROFILE
arn:aws:iam::xxxxxxxxxxx:instance-profile/ecsInstanceRole
SECURITY_GROUP
sg-xxxxxxxxx
SUBNET
subnet-xxxxxxxxx
PEM_KEY_NAME
key_name_no_extension
AMI_ID
ami-xxxxxxxxxxxxxxxxx
Create a launch template:
aws ec2 create-launch-template --cli-input-json file://ecs_launch_template.json
The launch template should look similar to the below:
{ "LaunchTemplate": { "LaunchTemplateId": "lt-xxxxxxxx", "LaunchTemplateName": "ECS_DL1_EFS", "CreateTime": "2022-09-06T20:58:49.000Z", "CreatedBy": "............", "DefaultVersionNumber": 1, "LatestVersionNumber": 1, "Tags": [ { "Key": "purpose", "Value": "batch training" } ] } }
Create AWS Batch Compute Environment¶
A Compute Environment manages/scales the AWS EC2 compute resources. Follow the below steps:
Create a
dl1_batch_ce.json
file with the following configuration, and update the placeholders as described in the table below:{ "computeEnvironmentName": "dl1_mnp_ce", "type": "MANAGED", "state": "ENABLED", "computeResources": { "type": "EC2", "minvCpus": 0, "maxvCpus": 192, "desiredvCpus": 0, "launchTemplate": { "launchTemplateId": "LAUNCH_TEMPLATE", "version": "1" }, "instanceTypes": [ "dl1.24xlarge" ], "subnets": [ "SUBNET" ], "ec2KeyPair": "PEM_KEY_NAME", "instanceRole": "INSTANCE_ROLE", "tags": { "Name": "dl1_mnp_ce" } }, "serviceRole": "BATCH_SERVICE_ROLE" }
PlaceHolder
Replace
INSTANCE_ROLE
arn:aws:iam::xxxxxxxx:instance-profile/ecsInstanceRole
LAUNCH_TEMPLATE
lt-xxxxxxxx
SUBNET
subnet-xxxxxxxxx
PEM_KEY_NAME
key_name_no_extension
BATCH_SERVICE_ROLE
arn:aws:iam::xxxxxxxxxxx:role/service-role/AWSBatchServiceRole
Note
DL1s require 96 vCPUs to run per instance. The maxvCpus field sets the ceiling for how many DL1s can be run in this compute environment. For this example, we are running 2 DL1s which require 192 vCPUs. The maxvCpus can be updated after creation to launch higher scales of DL1s.
Create a compute environment:
aws batch create-compute-environment --cli-input-json file://dl1_batch_ce.json
The output should look similar to the below:
{ "computeEnvironmentName": "dl1_mnp_ce", "computeEnvironmentArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:compute-environment/dl1_mnp_ce" }
Create AWS Job Queue¶
A Job Queue holds submitted jobs until they are scheduled to run in a compute environment. Follow the below steps:
Create a
dl1_batch_jq.json
file with the following configuration and update the placeholders:{ "jobQueueName": "dl1_mnp_jq", "state": "ENABLED", "priority": 1, "computeEnvironmentOrder": [ { "order": 1, "computeEnvironment": "dl1_mnp_ce" } ] }
Create a job queue:
aws batch create-job-queue --cli-input-json file://dl1_batch_jq.json
The output should look similar to the below:
{ "jobQueueName": "dl1_mnp_jq", "jobQueueArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-queue/dl1_mnp_jq" }