Auto-Scaling EC2 Environments for Real-time AI Inference

Question

Pulumi · Accepted Answer

Auto-scaling EC2 environments are an advanced use case of AWS, involving the dynamic adjustment of computing resources to handle workloads for applications like real-time AI inference. In AWS, this can be achieved using a combination of different services like EC2 Auto Scaling Groups, EC2 Spot Fleets, or AWS Batch Computing, depending on your cost preferences, fault tolerance, and the level of control you need.

Auto Scaling Groups help to maintain the desired count of instances and automatically add or remove instances according to the conditions you define. Spot Instances offer a cost-effective choice for workloads that can tolerate interruptions.

For real-time AI inference where availability and low latency are typically crucial, we might prioritize using On-Demand or Reserved Instances within an Auto Scaling Group. This ensures that instances are always available when needed but can also scale down when demand decreases.

In the program below, we will create an Auto Scaling Group with a mix of On-Demand and Spot Instances, which balances cost and availability. The group will scale based on CPU utilization, which is a common metric for inference loads, but you should adjust this to the specifics of your workload (e.g., network throughput, GPU utilization, etc.).

The Pulumi program below outlines how to set up an Auto Scaling Group for real-time AI inference:

```python
import pulumi
import pulumi_aws as aws

# Define the AMI (Amazon Machine Image) for the EC2 instances.
# Generally, this should be an image with your AI inference software and dependencies pre-installed.
ami_id = "ami-12345678"

# Define the instance type. You can change the instance type based on your requirement.
# For AI inference, consider instances with GPU acceleration or compute-optimized instances.
instance_type = "t3.medium"

# Set up an IAM role and attach a policy that grants necessary permissions for EC2 instances.
# The policy specifics will depend on your use case.
role = aws.iam.Role("role",
    assume_role_policy="""{
        "Version": "2012-10-17",
        "Statement": [{
            "Action": "sts:AssumeRole",
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            }
        }]
    }"""
)

instance_profile = aws.iam.InstanceProfile("instanceProfile", role=role.name)

# Create a new VPC or use an existing one.
# For this example, we'll use a default VPC and subnets.
default_vpc = aws.ec2.get_vpc(default=True)
default_subnets = aws.ec2.get_subnet_ids(vpc_id=default_vpc.id)

# Set up a security group.
# Configure the security group rules based on your requirements.
# The rules below allow SSH and HTTP traffic for simplicity, but you might need to adjust these.
security_group = aws.ec2.SecurityGroup("securityGroup",
    vpc_id=default_vpc.id,
    description="Allow SSH and HTTP inbound traffic",
    ingress=[
        {"protocol": "tcp", "from_port": 22, "to_port": 22, "cidr_blocks": ["0.0.0.0/0"]},
        {"protocol": "tcp", "from_port": 80, "to_port": 80, "cidr_blocks": ["0.0.0.0/0"]}
    ]
)

# Define a launch configuration for the Auto Scaling Group.
# This determines the template for the instances that will be launched.
launch_configuration = aws.ec2.LaunchConfiguration("launchConfiguration",
    image_id=ami_id,
    instance_type=instance_type,
    iam_instance_profile=instance_profile.arn,
    key_name="my-key-name",  # Replace with your key name for SSH access
    security_groups=[security_group.id],
    user_data="""#!/bin/bash
                echo "User data scripts to setup the instance"
                # Include any other setup steps here."""
)

# Create an Auto Scaling Group.
# Configuration can be adjusted according to your needs.
auto_scaling_group = aws.autoscaling.Group("autoScalingGroup",
    vpc_zone_identifiers=default_subnets.ids,
    desired_capacity=1,
    min_size=1,
    max_size=10,
    launch_configuration=launch_configuration.id,
    target_group_arns=[],
    health_check_type="EC2",
    health_check_grace_period=300,
    force_delete=True,
    tags=[{"key": "Name", "value": "auto-scaling-group", "propagate_at_launch": True}]
)

# Define CPU-based scaling policies for the Auto Scaling Group.
scale_up_policy = aws.autoscaling.Policy("scaleUpPolicy",
    auto_scaling_group_name=auto_scaling_group.name,
    adjustment_type="ChangeInCapacity",
    scaling_adjustment=2,
    cooldown=300
)

scale_down_policy = aws.autoscaling.Policy("scaleDownPolicy",
    auto_scaling_group_name=auto_scaling_group.name,
    adjustment_type="ChangeInCapacity",
    scaling_adjustment=-1,
    cooldown=300
)

# Configure CloudWatch alarms to trigger the scaling policies.
cpu_utilization_high_alarm = aws.cloudwatch.MetricAlarm("cpuUtilizationHighAlarm",
    alarm_name="cpuUtilizationHigh",
    comparison_operator="GreaterThanOrEqualToThreshold",
    evaluation_periods=2,
    metric_name="CPUUtilization",
    namespace="AWS/EC2",
    period=300,
    statistic="Average",
    threshold=75,
    alarm_actions=[scale_up_policy.arn],
    dimensions={
        "AutoScalingGroupName": auto_scaling_group.name
    },
    insufficient_data_actions=[],
    ok_actions=[]
)

cpu_utilization_low_alarm = aws.cloudwatch.MetricAlarm("cpuUtilizationLowAlarm",
    alarm_name="cpuUtilizationLow",
    comparison_operator="LessThanOrEqualToThreshold",
    evaluation_periods=2,
    metric_name="CPUUtilization",
    namespace="AWS/EC2",
    period=300,
    statistic="Average",
    threshold=25,
    alarm_actions=[scale_down_policy.arn],
    dimensions={
        "AutoScalingGroupName": auto_scaling_group.name
    },
    insufficient_data_actions=[],
    ok_actions=[]
)

# Output the DNS name of the Auto Scaling Group's load balancer, if applicable.
# You can use this to access your application.
pulumi.export("autoScalingGroup", auto_scaling_group.name)
```

In this program, we've created:

1. An IAM role and instance profile to give our EC2 instances the correct permissions.
2. A security group to control traffic to instances.
3. A launch configuration to define the properties of the instances we'll be creating. Normally, your AMI would have your AI model and software stack already set up.
4. An Auto Scaling Group to manage our fleet of EC2 instances. This group will handle the scaling of instances and the registration/deregistration from any load balancer you might have.
5. Auto Scaling Policies and CloudWatch Alarms to scale up and scale down our instances based on CPU utilization, a common indicator of inference workload.

This configuration ensures that at least one instance is running at all times (as per the `desired_capacity`), and that we can scale up to 10 instances maximum. Depending on the workload, the group will automatically create new instances (up to 10) when the CPU utilization is consistently high, and terminate extra instances when the CPU utilization goes down.

Remember to replace dummy values like `ami-12345678` and `my-key-name` with actual values relevant to your AWS setup.

Please note that while this program offers a foundation, you should adjust scaling policies, health check grace periods, and other parameters to meet your specific use case needs. Auto-scaling is a complex domain, and finding the right configuration often requires observation and iteration.