EKS Autoscaling for AI/ML Batch Jobs

Question

Pulumi · Accepted Answer

In order to set up autoscaling for AI/ML batch jobs on Amazon EKS (Elastic Kubernetes Service), you will need to define an EKS cluster and configure the appropriate NodeGroups and, if needed, Horizontal Pod Autoscalers. Pulumi allows you to define your infrastructure as code, meaning you can create, modify, and version your setup programmatically, which is convenient for managing the complex configurations common in AI/ML workloads.

Here's how you would achieve this with Pulumi in Python:

1. Define the EKS Cluster: This is the foundation of your Kubernetes setup on AWS.
2. Configure NodeGroups with autoscaling: This allows you to scale the number of nodes (EC2 instances) in your cluster based on resource demand.
3. Deploy your AI/ML workload: You can deploy Kubernetes resources, such as Deployments or Jobs, which represent your batch processing tasks.
4. (Optionally) Define a Horizontal Pod Autoscaler: This Kubernetes resource can automatically scale the number of pods in a deployment or ReplicaSet.

Below is a Pulumi program written in Python that sets up an EKS cluster with an autoscaling NodeGroup suitable for running AI/ML batch jobs. The program assumes that you have already set up your AWS provider credentials. If you haven't, please refer to [Pulumi's documentation on AWS setup](https://www.pulumi.com/docs/intro/cloud-providers/aws/setup/).

```python
import pulumi
import pulumi_aws as aws
import pulumi_eks as eks

# Create an EKS cluster
cluster = eks.Cluster('ai-ml-cluster',
    instance_type='m5.large',  # m5.large is a general-purpose instance type with a balance of compute, memory, and networking.
    desired_capacity=2,        # Start with 2 worker nodes
    min_size=1,                # At minimum, have 1 worker node
    max_size=4,                # Can scale up to 4 worker nodes for heavier workloads
    # Define other cluster-related configurations as needed
)

# Define IAM roles for NodeGroup
node_role = aws.iam.Role('nodegroup-role',
    assume_role_policy=aws.iam.get_assume_role_policy_document(service='ec2.amazonaws.com').json,
)

# Attach necessary policies to the node role
aws.iam.RolePolicyAttachment('nodegroup-AmazonEKSWorkerNodePolicy',
    policy_arn='arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy',
    role=node_role,
)
aws.iam.RolePolicyAttachment('nodegroup-AmazonEKS_CNI_Policy',
    policy_arn='arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy',
    role=node_role,
)
aws.iam.RolePolicyAttachment('nodegroup-AmazonEC2ContainerRegistryReadOnly',
    policy_arn='arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly',
    role=node_role,
)

# Create an autoscaling NodeGroup
node_group = eks.NodeGroup('ai-ml-nodegroup',
    cluster=cluster.core,
    instance_type='m5.large',
    desired_capacity=2,
    min_size=1,
    max_size=4,
    node_role_arn=node_role.arn,
    labels={'ondemand': 'true'},  # Custom labels can be used for assigning workloads
    # Define other NodeGroup-related configurations as needed
)

# Output the cluster's kubeconfig
pulumi.export('kubeconfig', cluster.kubeconfig)

# When the program is updated or changed the pulumi up command will deploy the changes on AWS.
```

This program does the following:

- It initializes an EKS Cluster with the desired configurations. You can choose the instance types and desired capacity accordingly based on your workload needs.
- It defines the IAM role for the worker nodes, so they have the required permissions to operate within the EKS cluster.
- It then creates an EKS NodeGroup with autoscaling enabled. The `desired_capacity` defines the initial number of nodes, `min_size` is the minimum number of nodes that should always be running, and `max_size` is the maximum number of nodes the group can scale out to.

The Horizontal Pod Autoscaler and other workload-specific resources (like K8s Deployments, Jobs, etc.) would be defined similarly using the Pulumi Kubernetes provider, targeting the created EKS cluster. You can explore this through the [Pulumi Kubernetes documentation](https://www.pulumi.com/docs/reference/pkg/kubernetes/).

Remember, you need to install the necessary Pulumi packages before running this program. Use `pip` to install the `pulumi` package, `pulumi_aws`, and `pulumi_eks`.

To apply these changes to your cloud environment, run this program with the Pulumi CLI:

```
pulumi up
```

This will provision the resources as defined in your Python code.