Create AWS Batch Compute Environment
On this Page
Create AWS Batch Compute Environment¶
AWS Batch is comprised of a compute environment, job queue, job definition, and jobs.
This section focuses on:
Creating an EC2 Launch Template
Creating a Compute Environment
Creating a Job Queue
Create an EC2 Launch Template¶
A Launch Template is a configurable base configuration for each node in the AWS Batch Cluster. To create a launch template, follow the below steps:
Find the latest Habana ECS AMI ID to set up the node:
aws ec2 describe-images --region us-west-2 --filters "Name=name,Values=habanalabs-ecs*" --query 'Images[].{Name: Name, ImageID: ImageId}'
Note
Having more than one EFA network interface prevents allocating a public IP address. To connect to the instance, one option is to assign an Elastic IP address.
Create an
ecs_launch_template.json
file with the following configuration and update the placeholders:
{
"DryRun": false,
"LaunchTemplateName": "ECS_DL1_EFS",
"VersionDescription": "Override Template",
"LaunchTemplateData": {
"IamInstanceProfile": {
"Arn": "INSTANCE_PROFILE"
},
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"VolumeSize": 200,
"DeleteOnTermination": true
}
}
],
"NetworkInterfaces": [
{
"AssociatePublicIpAddress": false,
"DeleteOnTermination": true,
"DeviceIndex": 0,
"Groups": [
"SECURITY_GROUP"
],
"InterfaceType": "efa",
"Ipv6AddressCount": 0,
"SubnetId": "SUBNET",
"NetworkCardIndex": 0
},
{
"AssociatePublicIpAddress": false,
"DeleteOnTermination": true,
"DeviceIndex": 1,
"Groups": [
"SECURITY_GROUP"
],
"InterfaceType": "efa",
"Ipv6AddressCount": 0,
"SubnetId": "SUBNET",
"NetworkCardIndex": 1
},
{
"AssociatePublicIpAddress": false,
"DeleteOnTermination": true,
"DeviceIndex": 2,
"Groups": [
"SECURITY_GROUP"
],
"InterfaceType": "efa",
"Ipv6AddressCount": 0,
"SubnetId": "SUBNET",
"NetworkCardIndex": 2
},
{
"AssociatePublicIpAddress": false,
"DeleteOnTermination": true,
"DeviceIndex": 3,
"Groups": [
"SECURITY_GROUP"
],
"InterfaceType": "efa",
"Ipv6AddressCount": 0,
"SubnetId": "SUBNET",
"NetworkCardIndex": 3
}
],
"ImageId": "AMI_ID",
"KeyName": "PEM_KEY_NAME",
"Monitoring": {
"Enabled": true
},
"DisableApiTermination": false,
"InstanceInitiatedShutdownBehavior": "stop",
"UserData": "",
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{
"Key": "purpose",
"Value": "batch multinode training"
}
]
}
],
"MetadataOptions": {
"HttpTokens": "required",
"HttpPutResponseHopLimit": 5,
"HttpEndpoint": "enabled"
}
},
"TagSpecifications": [
{
"ResourceType": "launch-template",
"Tags": [
{
"Key": "purpose",
"Value": "batch training"
}
]
}
]
}
PlaceHolder |
Replace |
---|---|
INSTANCE_PROFILE |
arn:aws:iam::xxxxxxxxxxx:instance-profile/ecsInstanceRole |
SECURITY_GROUP |
sg-xxxxxxxxx |
SUBNET |
subnet-xxxxxxxxx |
PEM_KEY_NAME |
key_name_no_extension |
AMI_ID |
ami-xxxxxxxxxxxxxxxxx |
Run the following aws command to create a launch template:
aws ec2 create-launch-template --cli-input-json file://ecs_launch_template.json
The launch template should look similar to the below:
{
"LaunchTemplate": {
"LaunchTemplateId": "lt-xxxxxxxx",
"LaunchTemplateName": "ECS_DL1_EFS",
"CreateTime": "2022-09-06T20:58:49.000Z",
"CreatedBy": "............",
"DefaultVersionNumber": 1,
"LatestVersionNumber": 1,
"Tags": [
{
"Key": "purpose",
"Value": "batch training"
}
]
}
}
Create AWS Batch Compute Environment¶
A Compute Environment manages/scales the AWS EC2 compute resources. Follow the below steps:
Create a
dl1_batch_ce.json
file with the following configuration and update the placeholders:
{
"computeEnvironmentName": "dl1_mnp_ce",
"type": "MANAGED",
"state": "ENABLED",
"computeResources": {
"type": "EC2",
"minvCpus": 0,
"maxvCpus": 192,
"desiredvCpus": 0,
"launchTemplate": {
"launchTemplateId": "LAUNCH_TEMPLATE",
"version": "1"
},
"instanceTypes": [
"dl1.24xlarge"
],
"subnets": [
"SUBNET"
],
"ec2KeyPair": "PEM_KEY_NAME",
"instanceRole": "INSTANCE_ROLE",
"tags": {
"Name": "dl1_mnp_ce"
}
},
"serviceRole": "BATCH_SERVICE_ROLE"
}
PlaceHolder |
Replace |
---|---|
INSTANCE_ROLE |
arn:aws:iam::xxxxxxxx:instance-profile/ecsInstanceRole |
LAUNCH_TEMPLATE |
lt-xxxxxxxx |
SUBNET |
subnet-xxxxxxxxx |
PEM_KEY_NAME |
key_name_no_extension |
BATCH_SERVICE_ROLE |
arn:aws:iam::xxxxxxxxxxx:role/service-role/AWSBatchServiceRole |
Note
DL1s require 96 vCPUs to run per instance. The maxvCpus field sets the ceiling for how many DL1s can be run in this compute environment. For this example, we are running 2 DL1s which require 192 vCPUs. The maxvCpus can be updated after creation to launch higher scales of DL1s.
Run the aws command to create a compute environment:
aws batch create-compute-environment --cli-input-json file://dl1_batch_ce.json
The output should look similar to the below:
{
"computeEnvironmentName": "dl1_mnp_ce",
"computeEnvironmentArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:compute-environment/dl1_mnp_ce"
}
Create AWS Job Queue¶
A Job Queue holds submitted jobs until they are scheduled to run in a compute environment. Follow the below steps:
Create a
dl1_batch_jq.json
file with the following configuration and update the placeholders:
{
"jobQueueName": "dl1_mnp_jq",
"state": "ENABLED",
"priority": 1,
"computeEnvironmentOrder": [
{
"order": 1,
"computeEnvironment": "dl1_mnp_ce"
}
]
}
Run the aws command to create a job queue:
aws batch create-job-queue --cli-input-json file://dl1_batch_jq.json
The output should look similar to the below:
{
"jobQueueName": "dl1_mnp_jq",
"jobQueueArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-queue/dl1_mnp_jq"
}