High-Throughput Inference Serving with ECS Capacity Providers

Question

Pulumi · Accepted Answer

To accomplish high-throughput inference serving using Amazon Elastic Container Service (ECS) and its capacity providers, you will need to create an ECS cluster, define and register an ECS task definition with your inference serving container image, and then set up the ECS services to run these tasks with the appropriate capacity provider strategies.

AWS capacity providers are a way to manage the compute capacity that your tasks use to ensure that you have the right amount of resources at the right time. When you set up a capacity provider, you define how your cluster can scale out (add more instances) or scale in (remove instances) depending on the demand.

For this example, I'm going to assume that you are going to use EC2 instances as the computing resources and that you already have an auto-scaling group defined.

Let's walk through the Pulumi program step by step:

1. **Define an ECS Cluster**: An ECS cluster is a logical grouping of tasks or services. You need to create an ECS cluster before you can run tasks or services that use the EC2 launch type.
2. **Define an ECS Task Definition**: The task definition is like a blueprint for your application and defines the containers that will be run on the ECS service.
3. **Define an ECS Service and Capacity Providers**: The ECS service allows you to run and maintain a specified number of instances of a task definition simultaneously. Capacity providers allow your ECS services to scale out by adding more EC2 instances or scale in by terminating instances.

Here's the Pulumi program in Python that sets up the ECS cluster, capacity provider, and an ECS service:

```python
import pulumi
import pulumi_aws as aws

# Define an ECS Cluster
ecs_cluster = aws.ecs.Cluster("ecsCluster")

# Assuming you have an existing Auto Scaling group, pass its ARN.
# Define an ECS Capacity Provider
ecs_capacity_provider = aws.ecs.CapacityProvider("ecsCapacityProvider",
    name="myCapacityProvider",
    auto_scaling_group_provider=aws.ecs.CapacityProviderAutoScalingGroupProviderArgs(
        auto_scaling_group_arn="arn:aws:autoscaling:region:account-id:autoScalingGroup:auto-scaling-group-name:autoScalingGroupName/auto-scaling-group-name",
        managed_scaling=aws.ecs.CapacityProviderAutoScalingGroupProviderManagedScalingArgs(
            maximum_scaling_step_size=1000,
            minimum_scaling_step_size=1,
            status="ENABLED",
            target_capacity=75,
        ),
        managed_termination_protection="ENABLED",
    ),
    tags={"Name": "ecsCapacityProvider"}
)

# Define an ECS Task Definition
task_definition = aws.ecs.TaskDefinition("appTaskDefinition",
    family="service",
    cpu="256",
    memory="512",
    network_mode="awsvpc",
    requires_compatibilities=["FARGATE"],
    execution_role_arn=aws_iam_role["ecs_execution_role"].arn,  # Assuming ecs_execution_role is already defined
    container_definitions=pulumi.Output.all(container_image_url).apply(lambda url: f"""
        [
            {{
                "name": "my-inference-service",
                "image": "{url}",
                "cpu": 256,
                "memory": 512,
                "essential": true,
                "portMappings": [
                    {{
                        "containerPort": 80,
                        "hostPort": 80
                    }}
                ]
            }}
        ]
    """)
)

# Define an ECS Service with the Capacity Provider
ecs_service = aws.ecs.Service("appService",
    cluster=ecs_cluster.arn,
    desired_count=3,
    launch_type="FARGATE",
    capacity_provider_strategies=[aws.ecs.ServiceCapacityProviderStrategiesArgs(
        capacity_provider="myCapacityProvider",
        weight=1,
        base=1
    )],
    task_definition=task_definition.arn,
    network_configuration=aws.ecs.ServiceNetworkConfigurationArgs(
        subnets=["subnet-abcdefgh", "subnet-12345678"],  # Specify your subnets here
        security_groups=["sg-12345678"],  # Specify your security groups here
        assign_public_ip=True,
    ),
    load_balancers=[aws.ecs.ServiceLoadBalancersArgs(
        target_group_arn="arn:aws:elasticloadbalancing:region:account-id:targetgroup/my-targets/1234567890123456",
        container_name="my-inference-service",
        container_port=80,
    )],
    opts=pulumi.ResourceOptions(depends_on=[ecs_capacity_provider])
)

# Export the service name and cluster name
pulumi.export('service_name', ecs_service.name)
pulumi.export('cluster_name', ecs_cluster.name)
```

In this program, you're setting up resources with the following roles:

- **ECS Cluster**: A cluster named 'ecsCluster' to organize your service's resources.
- **ECS Capacity Provider**: A capacity provider named 'myCapacityProvider,' configured to a desired target capacity of 75%. This percentage is based on the preference for scaling the number of tasks.
- **ECS Task Definition**: A task definition 'appTaskDefinition' that defines your inference serving application with the assumed placeholder `container_image_url`, which represents where your container image is stored.
- **ECS Service**: A service 'appService' that ties together your 'ecsCluster', 'ecsCapacityProvider', and 'appTaskDefinition'. It ensures that your service maintains 3 tasks, based on the capacity provider's strategy.

Substitute the `container_image_url`, `ecs_execution_role`, `auto_scaling_group_arn`, `subnets`, `security_groups`, and `target_group_arn` with your actual resource information.

After you set up the resources, you export the service name and cluster name as stack outputs, which you can retrieve using `pulumi stack output` command after deploying your Pulumi app.

This service is now ready to scale high-throughput inference workloads with the capacity provider's scaling policies in place.