Create and Submit AWS Batch Job

Job Definitions are the blueprints on how to run jobs. This section focuses on:

  • Creating a Job Definition to run MNIST training

  • Submitting a Job and view the results

Create AWS Batch Job Definition

Job Definitions specifies how jobs run. Many job definition parameters can be overridden at runtime/submission. Follow the below steps:

  1. Create dl1_batch_jd.json with the following configuration and update the placeholders:

{
    "jobDefinitionName": "dl1_mnist_batch_jd",
    "type": "multinode",
    "nodeProperties": {
        "numNodes": 2,
        "mainNode": 0,
        "nodeRangeProperties": [
            {
                "targetNodes": "0:",
                "container": {
                    "image": "IMAGE_NAME",
                    "command": [],
                    "jobRoleArn": "TASK_EXEC_ROLE",
                    "resourceRequirements": [
                        {
                            "type": "MEMORY",
                            "value": "760000"
                        },
                        {
                            "type": "VCPU",
                            "value": "96"
                        }
                    ],
                    "mountPoints": [],
                    "volumes": [],
                    "environment": [],
                    "ulimits": [],
                    "instanceType": "dl1.24xlarge",
                    "linuxParameters": {
                        "devices": [
                            {
                                "hostPath": "/dev/infiniband/uverbs0",
                                "containerPath": "/dev/infiniband/uverbs0",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs1",
                                "containerPath": "/dev/infiniband/uverbs1",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs2",
                                "containerPath": "/dev/infiniband/uverbs2",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                            {
                                "hostPath": "/dev/infiniband/uverbs3",
                                "containerPath": "/dev/infiniband/uverbs3",
                                "permissions": [
                                    "READ",
                                    "WRITE",
                                    "MKNOD"
                                ]
                            },
                        ]
                    },
                    "privileged": true
                }
            }
        ]
    }
}

PlaceHolder

Replace

IMAGE_NAME

xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/dl1_batch_training:v1

TASK_EXEC_ROLE

arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole

  1. Run the aws command to create a job definition:

aws batch register-job-definition --cli-input-json file://dl1_batch_jd.json

# Expected Results
{
    "jobDefinitionName": "dl1_mnist_batch_jd",
    "jobDefinitionArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-definition/dl1_mnist_batch_jd:1",
    "revision": 1
}

Submit AWS Batch Job

Run the aws command to submit a job:

aws batch submit-job --job-name dl1_mnp_batch --job-definition dl1_mnist_batch_jd --job-queue dl1_mnp_jq --node-overrides numNodes=2

# Expected Results
{
    "jobArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job/a434b6e9-5fda-415d-befb-079b04c95a97",
    "jobName": "dl1_mnp_batch",
    "jobId": "a434b6e9-5fda-415d-befb-079b04c95a97"
}

Note

Jobs status can also be submitted/viewed through the AWS Batch Console

Observe Submitted AWS Batch Job Logs

AWS Batch creates a Log that is hosted in CloudWatch. Follow View Log Data sent to CloudWatch Logs for specific instructions.