High Availability for AI Model Serving with AWS ALB

Question

Pulumi · Accepted Answer

To achieve high availability for AI model serving with AWS Application Load Balancer (ALB), we'll create a setup where the ALB distributes incoming application traffic across multiple targets such as EC2 instances. This setup increases the availability of your application, as it can handle the failure of individual instances without impacting the overall application’s uptime. AWS ALB also supports automatic scaling and provides a central point for managing incoming traffic.

Here's how you would set up a high availability architecture for AI model serving using Pulumi:

1. **Load Balancer (AWS ALB):** This is the entry point for all incoming traffic to your AI model serving application. It will distribute the traffic across the available targets (like EC2 instances) in multiple availability zones to ensure high availability and fault tolerance.

2. **Target Group:** It is used to route requests to one or more registered targets. With ALB, you can also implement health checks for targets, and it routes traffic only to the healthy ones.

3. **Listeners:** These wait for incoming connection requests and are configured with rules to forward those requests to target groups based on conditions such as path or header values.

4. **Listener Rules:** Optionally, you can have more specific routing rules for your listener, so different paths or request types can be routed to different target groups.

5. **EC2 Instances:** These will be the actual servers where your AI models are loaded and served. The ALB will route requests to these instances based on the rules specified.

6. **Auto-Scaling Group:** It helps to adjust the number of EC2 instances in response to the incoming application load, ensuring that you have enough capacity to handle traffic spikes without manual intervention.

Below is the Pulumi Python program that sets up a highly available environment for AI model serving on AWS. This includes an Application Load Balancer, target groups, and listeners, but it omits the deployment of the EC2 instances and the auto-scaling group for brevity. However, I will provide explanations for these as well.

```python
import pulumi
import pulumi_aws as aws

# Create a new security group for the ALB
security_group = aws.ec2.SecurityGroup('alb-security-group',
    description='Enable HTTP access',
    ingress=[
        {
            'protocol': 'tcp',
            'from_port': 80,
            'to_port': 80,
            'cidr_blocks': ['0.0.0.0/0']
        }
    ],
    egress=[
        {
            'protocol': '-1',
            'from_port': 0,
            'to_port': 0,
            'cidr_blocks': ['0.0.0.0/0']
        }
    ]
)

# Create an Application Load Balancer
alb = aws.lb.LoadBalancer('app-lb',
    internal=False,
    security_groups=[security_group.id],
    subnets=[aws.ec2.Subnet.get('subnet-1').id, aws.ec2.Subnet.get('subnet-2').id],  # Reference your subnets here
    load_balancer_type="application"
)

# Create a default target group
default_target_group = aws.lb.TargetGroup('default-target-group',
    port=80,
    protocol='HTTP',
    vpc_id=aws.ec2.Vpc.get('vpc').id,  # Reference your VPC here
    health_check={
        'path': '/health',  # Replace with the path used for the health check of your AI model serving application
        'protocol': 'HTTP',
        'interval': 30,
        'timeout': 3,
        'healthy_threshold': 2,
        'unhealthy_threshold': 2
    },
    stickiness={
        'type': 'lb_cookie',
        'enabled': False,  # Sticky sessions may not be needed for stateless AI model serving
    }
)

# Create a listener for incoming HTTP traffic
listener = aws.lb.Listener('listener',
    load_balancer_arn=alb.arn,
    port=80,
    default_actions=[{
        'type': 'forward',
        'target_group_arn': default_target_group.arn
    }]
)

# (Optional) Additional Listener Rules can be created for content-based routing if needed here

# Export the DNS name of the ALB to access the application
pulumi.export('alb_dns_name', alb.dns_name)
```

Here's what each part of this program does:

- We start by creating a security group that allows inbound HTTP traffic and unrestricted outbound traffic.

- We then create an Application Load Balancer that is Internet-facing. It uses two subnets, which should be in different availability zones for high availability.

- Next, we create a target group with a health check configuration. This health check ensures that traffic is only sent to healthy instances serving your AI models.

- A listener is added to our load balancer to forward HTTP traffic (port 80) to the default target group.

- Lastly, we export the DNS name of the Application Load Balancer so you can use it to send traffic to your application.

To complete the setup, you would need to create EC2 instances using `aws.ec2.Instance`, configure an Auto Scaling Group using `aws.autoscaling.Group`, and register the instances with your target group. The Auto Scaling Group would handle the scaling and replacement of instances to maintain high availability.

You can consult Pulumi's [AWS documentation](https://www.pulumi.com/docs/reference/pkg/aws/) for more detail on each of these components and instructions on adding EC2 instances and the auto-scaling group to your Pulumi program.