Auto-Scaling Clusters for Distributed ML Training on ECS

Question

Pulumi · Accepted Answer

Creating auto-scaling clusters for distributed machine learning (ML) training on Amazon ECS (Elastic Container Service) involves multiple components working together. Amazon ECS allows you to easily run, scale, and secure Docker container applications on AWS. By leveraging ECS, you can define tasks and services that describe your ML application built from containers.

For auto-scaling, you would typically use an ECS Service with an Application Auto Scaling policy. This policy adjusts the desired count of your tasks within the service based on defined metrics and scaling policies.

Here is what you need to set up:

1. An ECS Cluster: This acts as the logical grouping of tasks or services within ECS. You don't usually have to manage the underlying instances directly, ECS handles that for you.

2. Task Definitions: These are blueprints for your application that define the containers you want to run, alongside their resource requirements and other configurations.

3. ECS Services: Services let you run and maintain a specified number of instances of a task definition simultaneously. If any of your tasks should fail or stop for any reason, the ECS service scheduler launches another instance of your task definition to replace it.

4. Auto Scaling Policies: These are used to scale the number of tasks up or down based on the load or other metrics. For ML workloads, this could be based on the volume of data that needs to be processed, CPU utilization, memory usage, etc.

Let's create a simple Python program that uses Pulumi to deploy an auto-scaling ECS cluster for distributed ML training purposes.

```python
import pulumi
import pulumi_aws as aws

# Create an ECS cluster
cluster = aws.ecs.Cluster("ml-cluster")

# Define IAM roles
ecs_task_execution_role = aws.iam.Role(
    "ecsTaskExecutionRole",
    assume_role_policy=json.dumps({
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "ecs-tasks.amazonaws.com"
            }
        }]
    })
)

# Attach the task execution role policy
policy_attachment = aws.iam.RolePolicyAttachment("ecs-task-execution-role-policy",
    role=ecs_task_execution_role.name,
    policy_arn="arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
)

# Define a task definition for ML training
task_definition = aws.ecs.TaskDefinition("ml-task",
    family="ml-task-family",
    cpu="256",
    memory="512",
    network_mode="awsvpc",
    requires_compatibilities=["FARGATE"],
    execution_role_arn=ecs_task_execution_role.arn,
    container_definitions=pulumi.Output.all().apply(lambda args: json.dumps([
        {
            "name": "ml-container",
            "image": "your-docker-image",  # Replace with your Docker image for ML model training
            "cpu": 256,
            "memory": 512,
            "essential": True,
            "portMappings": [{
                "containerPort": 80,
                "hostPort": 80
            }],
        }
    ]))
)

# Define an ECS Service with auto-scaling
service = aws.ecs.Service("ml-service",
    cluster=cluster.arn,
    task_definition=task_definition.arn,
    launch_type="FARGATE",
    desired_count=1,
    network_configuration=aws.ecs.ServiceNetworkConfigurationArgs(
        subnets=["subnet-xxxxxxxxxxxxxxxxx"], # Replace with your VPC subnets
        security_groups=["sg-xxxxxxxxxxxxxxxxx"], # Replace with your security group
        assign_public_ip=True,
    ),
    load_balancers=[aws.ecs.ServiceLoadBalancerArgs(
        target_group_arn="arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/6d0ecf831eec9f09",
        container_name="ml-container",
        container_port=80,
    )],
    opts=pulumi.ResourceOptions(depends_on=[policy_attachment])
)

# Enable auto-scaling for the service
scaling_target = aws.appautoscaling.Target("ecs-appautoscaling-target",
    max_capacity=10,
    min_capacity=1,
    resource_id=pulumi.Output.concat("service/", cluster.name, "/", service.name),
    scalable_dimension="ecs:service:DesiredCount",
    service_namespace="ecs",
)

# Define the scaling policy
scaling_policy = aws.appautoscaling.Policy("ecs-appautoscaling-policy",
    policy_type="TargetTrackingScaling",
    resource_id=scaling_target.resource_id,
    scalable_dimension=scaling_target.scalable_dimension,
    service_namespace=scaling_target.service_namespace,
    target_tracking_scaling_policy_configuration=aws.appautoscaling.PolicyTargetTrackingScalingPolicyConfigurationArgs(
        target_value=75.0,
        scale_in_cooldown=60,
        scale_out_cooldown=60,
        predefined_metric_specification=aws.appautoscaling.PolicyTargetTrackingScalingPolicyConfigurationPredefinedMetricSpecificationArgs(
            predefined_metric_type="ECSServiceAverageCPUUtilization"
        ),
    ),
    depends_on=[scaling_target]
)

# Export the ECS cluster name and service name
pulumi.export("cluster_name", cluster.name)
pulumi.export("service_name", service.name)
```

Make sure to replace `"your-docker-image"` with the Docker image URL for your ML training application. Also, replace the subnet and security group IDs with the ones that fit your setup. This example assumes you're using AWS Fargate for serverless container execution, which takes away the need to manage instances.

The auto-scaling policies in this example will scale the number of tasks up or down to maintain an average CPU utilization of 75%. You can customize the policies further based on your specific ML workload needs.

With this configuration, you have established a starting point for an auto-scaling ECS cluster that can adjust resources based on the demand for your ML training workload.