Auto-scaling Inference Services for LLMs on AWS ECS.
PythonTo deploy auto-scaling inference services for Large Language Models (LLMs) on AWS using the Elastic Container Service (ECS), you will need to create the following resources:
- ECS Cluster: A logical grouping of tasks or services. This is where your container workloads run.
- Task Definition: A blueprint for your application that specifies the Docker container configurations.
- ECS Service: This maintains the desired count of instances of the task definition in an ECS cluster. For your LLMs, this service should be configured to use auto-scaling to handle varying loads.
- Auto Scaling: AWS Application Auto Scaling to automatically adjust the number of tasks running based on demand.
Here is a detailed Pulumi program written in Python that sets up these components. The program:
- Defines a new ECS cluster.
- Creates a task definition with the necessary container image and specifications for the inference service.
- Sets up an ECS service with a load balancer to distribute incoming requests.
- Configures auto-scaling using AWS Application Auto Scaling with target tracking policies to adjust the task count automatically based on CPU utilization.
import pulumi import pulumi_aws as aws # Define an ECS Cluster where the inference services will be deployed cluster = aws.ecs.Cluster("inference-cluster") # Define an ECS Task Definition with the Docker image and required specifications for the LLM service task_definition = aws.ecs.TaskDefinition("inference-task", family="inference", cpu="256", # Adjust based on your needs memory="512", # Adjust based on your needs network_mode="awsvpc", requires_compatibilities=["FARGATE"], # Using Fargate for serverless compute execution_role_arn=aws.iam.Role("ecsExecutionRole").arn, container_definitions=pulumi.Output.all(cluster.arn).apply(lambda args: f""" [ {{ "name": "inference-service", "image": "my-llm-inference-service-image", # Replace with your Docker image "cpu": 256, "memory": 512, "essential": true, "portMappings": [ {{ "containerPort": 80, "hostPort": 80 }} ], "environment": [ {{ "name": "ENV_VAR_NAME", "value": "some-value" # Replace with any environment variables your service needs }} ], }} ] """)) # Create a Load Balancer to distribute incoming traffic to the inference service load_balancer = aws.lb.LoadBalancer("inference-lb", internal=False, load_balancer_type="application", security_groups=[aws.ec2.SecurityGroup("inference-sg").id], subnets=aws.ec2.Subnet.get("default", pulumi.Input(aws.ec2.get_subnets())).ids) # Create a target group for the Load Balancer target_group = aws.lb.TargetGroup("inference-tg", port=80, protocol="HTTP", target_type="ip", vpc_id=load_balancer.vpc_id) # Establish the ECS Service which will govern the lifecycle of our inference tasks service = aws.ecs.Service("inference-service", cluster=cluster.arn, task_definition=task_definition.arn, desired_count=2, # Start with 2 tasks, this will be adjusted by auto-scaling launch_type="FARGATE", load_balancers=[{ "target_group_arn": target_group.arn, "container_name": "inference-service", "container_port": 80 }], network_configuration={ "subnets": aws.ec2.Subnet.get("default", pulumi.Input(aws.ec2.get_subnets())).ids, "security_groups": [aws.ec2.SecurityGroup("inference-sg").id] }, opts=pulumi.ResourceOptions(depends_on=[load_balancer])) # Configure Auto Scaling for the ECS Service scaling_target = aws.appautoscaling.Target("inference-scaling-target", max_capacity=10, # Upper limit on the number of tasks min_capacity=2, # Lower limit on the number of tasks resource_id=pulumi.Output.all(cluster.name, service.name).apply(lambda args: f"service/{args[0]}/{args[1]}"), scalable_dimension="ecs:service:DesiredCount", service_namespace="ecs") scaling_policy = aws.appautoscaling.Policy("inference-scaling-policy", policy_type="TargetTrackingScaling", resource_id=scaling_target.resource_id, scalable_dimension=scaling_target.scalable_dimension, service_namespace=scaling_target.service_namespace, target_tracking_scaling_policy_configuration={ "target_value": 70.0, # Target CPU utilization percentage for scaling "predefined_metric_specification": { "predefined_metric_type": "ECSServiceAverageCPUUtilization", }, }) # Export the URLs of the Load Balancer to access the inference service pulumi.export("load_balancer_dns", load_balancer.dns_name) pulumi.export("load_balancer_zone_id", load_balancer.zone_id)
This code sets up the infrastructure on AWS ECS that can automatically scale based on CPU usage, ensuring that the inference service has enough resources to handle incoming traffic while being cost-efficient.
Before running this code:
- Configure your AWS credentials using the Pulumi CLI.
- Replace
"my-llm-inference-service-image"
with the actual Docker image for your LLM service. - Update the CPU and memory settings if your inference tasks require different specifications.
- Set up the appropriate environment variables your service might need.
- Ensure your VPC, subnets, and security groups are correctly configured for external access where applicable.