Auto-Scaling ML Model Inference with EKS

Question

Pulumi · Accepted Answer

Auto-scaling an ML model inference workload on Amazon EKS (Elastic Kubernetes Service) involves creating a Kubernetes cluster, deploying machine learning model containers as pods, and setting up horizontal pod scaling to scale pods in and out based on the workload demands. In this scenario, Amazon's EKS is used to manage the Kubernetes infrastructure, and SageMaker is used to manage and deploy the ML models.

Let's break down the program:

1. **Creating an IAM Role for EKS**: The EKS cluster requires an IAM role with the necessary permissions to create and manage the Kubernetes resources.
2. **Creating an EKS Cluster**: We need to create an EKS cluster which will be the base of our Kubernetes workloads.
3. **Node Group Configuration**: We will also define a node group, which is a set of worker nodes that will run our ML workload.
4. **Deploying an Auto-Scaler**: We will deploy the Kubernetes Autoscaler that will watch the load and adjust the number of pod replicas.
5. **ML Model Setup**: Although in an actual project, you'd define your ML model container, we will just describe the placeholder steps for setting up the ML model inference service and scaling based on traffic.
6. **Scaling Setup**: Lastly, we will deploy the auto-scaling policies to manage the scaling of our ML model pod replicas.

Please note that I will comment on the purpose and functionality of each section of the code provided.

Here's the Pulumi program in Python that will set up an autoscaling ML model inference service with EKS:

```python
import pulumi
import pulumi_eks as eks
import pulumi_aws as aws

# Create an IAM role and instance profile to use with the EKS cluster.
role = aws.iam.Role("eksRole", assume_role_policy="""{
    "Version": "2012-10-17",
    "Statement": [{
        "Action": "sts:AssumeRole",
        "Effect": "Allow",
        "Principal": {
            "Service": "eks.amazonaws.com"
        }
    }]
}""")

# Attach the AmazonEKSClusterPolicy to the role.
role_attachment = aws.iam.RolePolicyAttachment("eksRoleAttachment",
    role=role.name,
    policy_arn="arn:aws:iam::aws:policy/AmazonEKSClusterPolicy")

# Create a VPC for our cluster.
vpc = aws.ec2.Vpc("vpc",
    cidr_block="10.100.0.0/16",
    enable_dns_hostnames=True,
    enable_dns_support=True,
    tags={
        "Name": "pulumi-eks-vpc",
    })

# Create subnets for the VPC. You’d want to create multiple to ensure high availability.
subnet = aws.ec2.Subnet("subnet",
    cidr_block="10.100.1.0/24",
    vpc_id=vpc.id,
    availability_zone="us-west-2a",
    tags={
        "Name": "pulumi-vpc-subnet",
    })

# Security group that allows HTTP ingress and unrestricted egress.
sg = aws.ec2.SecurityGroup("sg",
    vpc_id=vpc.id,
    description='Enable HTTP access',
    ingress=[{
        'protocol': 'tcp',
        'from_port': 80,
        'to_port': 80,
        'cidr_blocks': ['0.0.0.0/0'],
    }],
    egress=[{
        'protocol': '-1',
        'from_port': 0,
        'to_port': 0,
        'cidr_blocks': ['0.0.0.0/0'],
    }])

# Creating an EKS cluster using the default configuration.
cluster = eks.Cluster("cluster",
    role_arn=role.arn,
    vpc_config=eks.ClusterVpcConfigArgs(
        public_subnet_ids=[subnet.id],
        security_group_ids=[sg.id],
    ))

# The EKS cluster is up and running after this point, but we still need to configure
# worker nodes and the autoscaler.

# Create an instance profile for our node group workers.
node_instance_profile = aws.iam.InstanceProfile("nodeInstanceProfile",
    role=role.name)

# Create a new Kubernetes node group for the cluster.
node_group = eks.NodeGroup("nodeGroup",
    cluster=cluster.core.oidc_provider,
    cluster_name=cluster.eks_cluster.name,
    node_role_arn=role.arn,
    subnet_ids=[subnet.id],
    instance_type="m5.large",
    desired_capacity=2,
    min_size=1,
    max_size=4,  # You can adjust min_size and max_size to your projected traffic
    labels={"ondemand": "true"})

# At this point, your EKS Cluster is operational with a node group. Now, we'll need to deploy your ML workloads and configure
# autoscaling. This part will depend on your specific ML model and setup, but generally it looks like this:

# ...Deploying the ML model inference service...

# ...Configuring Horizontal Pod Autoscaling (HPA) for the ML service...

# Output the cluster's kubeconfig and the public load balancer's endpoint.
pulumi.export('kubeconfig', cluster.kubeconfig)
pulumi.export('loadBalancerEndpoint', cluster.load_balancer.endpoint)
```

This program sets up the cluster with a simple node group autoscaling between 1 and 4 instances, but for a real-world scenario, you'd configure your ML model pods, and set up autoscaling based on the CPU or custom metrics related to your ML model's resource consumption. You might also consider deploying the cluster and autoscaler with additional security features, monitoring, logging, and compliance setups, which are essential for managing production workloads.