Auto-Scaling AI Workloads with Kubernetes Elastic Agent

Question

Pulumi · Accepted Answer

Auto-scaling AI workloads on Kubernetes can greatly improve the efficiency and responsiveness of a system. To enable this, we can use the Horizontal Pod Autoscaler (HPA), which automatically scales the number of pods in a replication controller, deployment, replica set, or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics).

In this context, an "elastic agent" might be a pod that runs some component of the AI workload—perhaps a machine learning model inference service. As demand for that service grows or shrinks, the HPA will adjust the number of running instances to match.

Here's a high-level overview of the steps you'd take to enable this:

1. Define workloads using Kubernetes objects like Deployments or StatefulSets.
2. Configure resource requests for these workloads.
3. Enable metrics server in the cluster to provide resource utilization metrics to the HPA.
4. Define an HPA resource targeting the workload with specific scaling parameters.

Below, I'll provide a Pulumi Python program that sets up a simple AI workload with autoscaling. We'll simulate the AI aspect by creating a deployment that serves a dummy workload—the specifics of the AI service are not important for the purpose of learning how to configure auto-scaling.

**Note**: The following program assumes:
- You have a Kubernetes cluster and have set up kubectl access to it.
- You have installed the Pulumi CLI and set up a Pulumi project.
- You have the Pulumi Kubernetes Python SDK installed.

Now let's start writing the Pulumi program:

```python
import pulumi
import pulumi_kubernetes as k8s

# Create a Kubernetes Deployment to run your AI workload pods.
ai_deployment = k8s.apps.v1.Deployment(
    "ai-deployment",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        replicas=2,  # Start with 2 replicas. This will be managed by HPA.
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "ai-service"},  # Label selector to manage pods.
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "ai-service"},
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="ai-container",
                        image="your-ai-application-image",  # Replace with your AI application container image.
                        resources=k8s.core.v1.ResourceRequirementsArgs(
                            requests={
                                "cpu": "500m",  # Request amount of CPU cores needed.
                                "memory": "512Mi",  # Request amount of memory needed.
                            },
                        ),
                    ),
                ],
            ),
        ),
    ),
)

# Define a Horizontal Pod Autoscaler to scale your AI workload Deployment.
ai_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler(
    "ai-hpa",
    spec=k8s.autoscaling.v1.HorizontalPodAutoscalerSpecArgs(
        max_replicas=10,  # Set the maximum number of replicas.
        min_replicas=2,  # Set the minimum number of replicas.
        scale_target_ref=k8s.autoscaling.v1.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=ai_deployment.metadata["name"],
        ),
        target_cpu_utilization_percentage=50,  # Target CPU utilization percentage to trigger scaling.
    ),
)

# Export the name of the deployment and HPA
pulumi.export("deployment_name", ai_deployment.metadata["name"])
pulumi.export("hpa_name", ai_hpa.metadata["name"])
```

In this Pulumi program:
- We define a `Deployment` representing our AI workload, with a simple container image placeholder that you need to replace with your actual AI application's container image.
- The `Deployment` starts with 2 replicas, and each pod requests `500m` (500 milli-CPU cores) and `512Mi` of memory. These figures are illustrative and may vary depending on the specific requirements of your AI workloads.
- A `HorizontalPodAutoscaler` then targets the `Deployment`. It is configured to maintain a target CPU utilization of 50% across all replicas of the deployment. This means if the average CPU usage across all pods goes above this percentage, the HPA will start creating new pods, and if it falls below, it will remove pods, within the specified min and max replica range.
- The `pulumi.export` statements at the end of the script will output the names of the deployment and HPA after `pulumi up` is executed, which gives you the command to launch your Pulumi program.

**Additional information**:
- The Metrics Server in the Kubernetes cluster should be properly set up to provide the CPU and memory usage data. This is usually a pre-requisite for HPA to function.
- The actual container image for the AI workload should be created, published to a container registry, and its path provided in the `image` field of the Deployment definition. This setup is not covered in the script above.