Scalable AI Workload Orchestration with AWS Batch

Question

Pulumi · Accepted Answer

AWS Batch is a cloud service that enables you to easily and efficiently run hundreds of thousands of batch computing jobs on AWS. It dynamically provisions the optimal quantity and type of compute resources (e.g., CPU or memory-optimized instances) based on the volume and specific resource requirements of the batch jobs submitted.

With AWS Batch, you can set up a scalable and cost-effective computing environment without having to manage the underlying infrastructure. It takes care of scheduling and executing your batch jobs across a range of AWS compute services, such as AWS Fargate or Amazon EC2.

Here's an overview of how we'll set up a scalable AI workload orchestration using AWS Batch with Pulumi:

1. **Compute Environment**: This is where your jobs will run. You can choose managed environments that are maintained by AWS Batch, or unmanaged ones that you control.
2. **Job Queue**: This is where jobs are submitted. They reside here until computing resources to execute the job become available.
3. **Job Definitions**: These are templates for your batch jobs, specifying how jobs are to run (e.g., which Docker image to use, what commands to run, what memory and CPU requirements are needed, etc.).
4. **Scheduling Policy** (if necessary): Allows you to manage the prioritization and selection of jobs so higher-priority jobs are executed first, and resources are allocated optimally.

Now, let's create a Pulumi program that sets up these AWS Batch components using the `pulumi_aws` package.

```python
import pulumi
import pulumi_aws as aws

# Create an IAM role that AWS Batch can assume
batch_service_role = aws.iam.Role("batch_service_role",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Principal": {
                "Service": "batch.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }]
    }""")

# Attach the AWS managed Batch service policy to the role
batch_service_policy_attachment = aws.iam.RolePolicyAttachment("batch_service_policy_attachment",
    role=batch_service_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole")

# Create a Compute Environment
compute_environment = aws.batch.ComputeEnvironment("compute_environment",
    service_role=batch_service_role.arn,
    type="MANAGED",
    compute_resources={
        "type": "EC2",
        "min_vcpus": 0,
        "max_vcpus": 100,
        "instance_types": ["c5.large"],
        "subnets": ["subnet-04567c43b58cd6e18"],  # Replace this with your actual subnet
        "security_group_ids": ["sg-09a1234daac12345d"],  # Replace this with your actual security group id
        "allocation_strategy": "BEST_FIT"
    })

# Create a Job Queue
job_queue = aws.batch.JobQueue("job_queue",
    state="ENABLED",
    priority=1,
    compute_environments=[
        {
            "order": 1,
            "compute_environment": compute_environment.arn
        }
    ])

# Create a Job Definition
job_definition = aws.batch.JobDefinition("job_definition",
    type="container",
    container_properties="""{
        "image": "my-docker-image",  # Replace this with the URI of your Docker image
        "vcpus": 1,
        "memory": 512
    }""")

# Optional: create a Scheduling Policy
scheduling_policy = aws.batch.SchedulingPolicy("scheduling_policy",
    fair_share_policy={
        "share_decay_seconds": 3600,
        "compute_reservation": 1,
        "share_distribution": [{"share_identifier": job_queue.arn, "weight_factor": 1.0}]
    })

# Export the ARNs of the created resources
pulumi.export('compute_environment_arn', compute_environment.arn)
pulumi.export('job_queue_arn', job_queue.arn)
pulumi.export('job_definition_arn', job_definition.arn)
pulumi.export('scheduling_policy_arn', scheduling_policy.arn)
```

In the code above, we defined four Pulumi resources. Let’s break down each section:

- **IAM Role**: We created an IAM role that AWS Batch can use to interact with other AWS services. We attached the AWS managed policy for Batch services to this role.
- **Compute Environment**: We defined a managed compute environment where our jobs will run. The environment uses EC2 instances, and it can scale from 0 to 100 vCPUs based on job demand. Additionally, we specified a strategy for resource allocation.
- **Job Queue**: We set up a job queue and linked it to our compute environment. Jobs in this queue will run on the compute resources provided by our compute environment.
- **Job Definition**: We defined a job template providing details about how jobs should be executed. This includes the Docker image to run and the resources needed per job.
- **Scheduling Policy**: This is an optional resource that helps prioritize jobs. We used a fair share policy, which shares compute time fairly among jobs.

Remember to replace placeholder values such as subnets, security group IDs, and Docker image URIs with actual values from your setup.

When the above program is executed with Pulumi, it will provision these resources in your AWS account, setting the stage for you to submit and run AI workloads at scale with AWS Batch. To deploy your jobs, you'd construct and submit a job definition that matches the `job_definition` created in this setup.