Auto-scaling Inference Services for LLMs on AWS ECS.

Question

Pulumi · Accepted Answer

To deploy auto-scaling inference services for Large Language Models (LLMs) on AWS using the Elastic Container Service (ECS), you will need to create the following resources:

1. **ECS Cluster**: A logical grouping of tasks or services. This is where your container workloads run.
2. **Task Definition**: A blueprint for your application that specifies the Docker container configurations.
3. **ECS Service**: This maintains the desired count of instances of the task definition in an ECS cluster. For your LLMs, this service should be configured to use auto-scaling to handle varying loads.
4. **Auto Scaling**: AWS Application Auto Scaling to automatically adjust the number of tasks running based on demand.

Here is a detailed Pulumi program written in Python that sets up these components. The program:

- Defines a new ECS cluster.
- Creates a task definition with the necessary container image and specifications for the inference service.
- Sets up an ECS service with a load balancer to distribute incoming requests.
- Configures auto-scaling using AWS Application Auto Scaling with target tracking policies to adjust the task count automatically based on CPU utilization.

```python
import pulumi
import pulumi_aws as aws

# Define an ECS Cluster where the inference services will be deployed
cluster = aws.ecs.Cluster("inference-cluster")

# Define an ECS Task Definition with the Docker image and required specifications for the LLM service
task_definition = aws.ecs.TaskDefinition("inference-task",
    family="inference",
    cpu="256",  # Adjust based on your needs
    memory="512",  # Adjust based on your needs
    network_mode="awsvpc",
    requires_compatibilities=["FARGATE"],  # Using Fargate for serverless compute
    execution_role_arn=aws.iam.Role("ecsExecutionRole").arn,
    container_definitions=pulumi.Output.all(cluster.arn).apply(lambda args: f"""
    [
        {{
            "name": "inference-service",
            "image": "my-llm-inference-service-image",  # Replace with your Docker image
            "cpu": 256,
            "memory": 512,
            "essential": true,
            "portMappings": [
                {{
                    "containerPort": 80,
                    "hostPort": 80
                }}
            ],
            "environment": [
                {{
                    "name": "ENV_VAR_NAME",
                    "value": "some-value"  # Replace with any environment variables your service needs
                }}
            ],
        }}
    ]
    """))

# Create a Load Balancer to distribute incoming traffic to the inference service
load_balancer = aws.lb.LoadBalancer("inference-lb",
    internal=False,
    load_balancer_type="application",
    security_groups=[aws.ec2.SecurityGroup("inference-sg").id],
    subnets=aws.ec2.Subnet.get("default", pulumi.Input(aws.ec2.get_subnets())).ids)

# Create a target group for the Load Balancer
target_group = aws.lb.TargetGroup("inference-tg",
    port=80,
    protocol="HTTP",
    target_type="ip",
    vpc_id=load_balancer.vpc_id)

# Establish the ECS Service which will govern the lifecycle of our inference tasks
service = aws.ecs.Service("inference-service",
    cluster=cluster.arn,
    task_definition=task_definition.arn,
    desired_count=2,  # Start with 2 tasks, this will be adjusted by auto-scaling
    launch_type="FARGATE",
    load_balancers=[{
        "target_group_arn": target_group.arn,
        "container_name": "inference-service",
        "container_port": 80
    }],
    network_configuration={
        "subnets": aws.ec2.Subnet.get("default", pulumi.Input(aws.ec2.get_subnets())).ids,
        "security_groups": [aws.ec2.SecurityGroup("inference-sg").id]
    },
    opts=pulumi.ResourceOptions(depends_on=[load_balancer]))

# Configure Auto Scaling for the ECS Service
scaling_target = aws.appautoscaling.Target("inference-scaling-target",
    max_capacity=10,  # Upper limit on the number of tasks
    min_capacity=2,   # Lower limit on the number of tasks
    resource_id=pulumi.Output.all(cluster.name, service.name).apply(lambda args: f"service/{args[0]}/{args[1]}"),
    scalable_dimension="ecs:service:DesiredCount",
    service_namespace="ecs")

scaling_policy = aws.appautoscaling.Policy("inference-scaling-policy",
    policy_type="TargetTrackingScaling",
    resource_id=scaling_target.resource_id,
    scalable_dimension=scaling_target.scalable_dimension,
    service_namespace=scaling_target.service_namespace,
    target_tracking_scaling_policy_configuration={
        "target_value": 70.0,  # Target CPU utilization percentage for scaling
        "predefined_metric_specification": {
            "predefined_metric_type": "ECSServiceAverageCPUUtilization",
        },
    })

# Export the URLs of the Load Balancer to access the inference service
pulumi.export("load_balancer_dns", load_balancer.dns_name)
pulumi.export("load_balancer_zone_id", load_balancer.zone_id)
```

This code sets up the infrastructure on AWS ECS that can automatically scale based on CPU usage, ensuring that the inference service has enough resources to handle incoming traffic while being cost-efficient.

Before running this code:

- Configure your AWS credentials using the Pulumi CLI.
- Replace `"my-llm-inference-service-image"` with the actual Docker image for your LLM service.
- Update the CPU and memory settings if your inference tasks require different specifications.
- Set up the appropriate environment variables your service might need.
- Ensure your VPC, subnets, and security groups are correctly configured for external access where applicable.