Create and Submit AWS Batch Job

Job definitions are the blueprints on how to run jobs. This section focuses on:

  • Creating a job definition to run MNIST training

  • Submitting a job and view the results

Create AWS Batch Job Definition

Job definitions specify how jobs run. Many job definition parameters can be overridden at runtime or submission. Follow the below steps:

  1. Create dl1_batch_jd.json with the following configuration, and update the placeholders as described in the table below:

    {
        "jobDefinitionName": "dl1_mnist_batch_jd",
        "type": "multinode",
        "nodeProperties": {
            "numNodes": 2,
            "mainNode": 0,
            "nodeRangeProperties": [
                {
                    "targetNodes": "0:",
                    "container": {
                        "image": "IMAGE_NAME",
                        "command": [],
                        "jobRoleArn": "TASK_EXEC_ROLE",
                        "resourceRequirements": [
                            {
                                "type": "MEMORY",
                                "value": "760000"
                            },
                            {
                                "type": "VCPU",
                                "value": "96"
                            }
                        ],
                        "mountPoints": [],
                        "volumes": [],
                        "environment": [],
                        "ulimits": [],
                        "instanceType": "dl1.24xlarge",
                        "linuxParameters": {
                            "devices": [
                                {
                                    "hostPath": "/dev/infiniband/uverbs0",
                                    "containerPath": "/dev/infiniband/uverbs0",
                                    "permissions": [
                                        "READ",
                                        "WRITE",
                                        "MKNOD"
                                    ]
                                },
                                {
                                    "hostPath": "/dev/infiniband/uverbs1",
                                    "containerPath": "/dev/infiniband/uverbs1",
                                    "permissions": [
                                        "READ",
                                        "WRITE",
                                        "MKNOD"
                                    ]
                                },
                                {
                                    "hostPath": "/dev/infiniband/uverbs2",
                                    "containerPath": "/dev/infiniband/uverbs2",
                                    "permissions": [
                                        "READ",
                                        "WRITE",
                                        "MKNOD"
                                    ]
                                },
                                {
                                    "hostPath": "/dev/infiniband/uverbs3",
                                    "containerPath": "/dev/infiniband/uverbs3",
                                    "permissions": [
                                        "READ",
                                        "WRITE",
                                        "MKNOD"
                                    ]
                                },
                            ],
                            "sharedMemorySize": 258048
                        },
                        "privileged": true
                    }
                }
            ]
        }
    }
    

    PlaceHolder

    Replace

    IMAGE_NAME

    xxxxxxxxx.dkr.ecr.us-west-2.amazonaws.com/dl1_batch_training:v1

    TASK_EXEC_ROLE

    arn:aws:iam::xxxxxxx:role/ecsTaskExecutionRole

  2. Create a job definition:

    aws batch register-job-definition --cli-input-json file://dl1_batch_jd.json
    
    # Expected Results
    {
        "jobDefinitionName": "dl1_mnist_batch_jd",
        "jobDefinitionArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job-definition/dl1_mnist_batch_jd:1",
        "revision": 1
    }
    

Submit AWS Batch Job

To submit a job, run the following command:

aws batch submit-job --job-name dl1_mnp_batch --job-definition dl1_mnist_batch_jd --job-queue dl1_mnp_jq --node-overrides numNodes=2

# Expected Results
{
    "jobArn": "arn:aws:batch:us-west-2:xxxxxxxxxxxx:job/a434b6e9-5fda-415d-befb-079b04c95a97",
    "jobName": "dl1_mnp_batch",
    "jobId": "a434b6e9-5fda-415d-befb-079b04c95a97"
}

Note

The jobs’ status can also be submitted and viewed through the AWS Batch Console.

Observe Submitted AWS Batch Job Logs

AWS Batch creates a log that is hosted in CloudWatch. Follow View Log Data sent to CloudWatch Logs for specific instructions.