Auto-Scaling AI Workloads with Kubernetes HPA & Kuma

Question

Pulumi · Accepted Answer

To facilitate auto-scaling for AI workloads in a Kubernetes cluster using Horizontal Pod Autoscaler (HPA), you would typically leverage the Kubernetes HPA resource to dynamically scale the number of pods in a deployment or replica set. HPA adjusts the number of running pods in response to observed CPU utilization or with custom metrics provided by third-party metrics systems.

Kuma, on the other hand, is a service mesh that can run on Kubernetes, providing features like observability, traffic control, security, and discovery. It's not directly involved in auto-scaling but could be used in conjunction with Kubernetes to enhance the network and security aspects of your services.

Here, I'll guide you through a Pulumi program that demonstrates how to set up an HPA for your Kubernetes workloads. This program does not directly integrate Kuma, but you can install Kuma as a service mesh in your Kubernetes cluster to manage the microservices that your AI workloads may be composed of. For simplicity, the example below will focus on the Kubernetes Horizontal Pod Autoscaler:

### Detailed Explanation

1. **Kubernetes Python SDK**: We'll use Pulumi's Kubernetes SDK in Python to define our Kubernetes resources.
2. **HorizontalPodAutoscaler**: This resource automatically scales the number of pods in a replication controller, deployment, or replica set based on observed CPU utilization.
3. **Deployment**: Before we can autoscale our pods, we need to have a deployment that defines the pods we want to scale. This will be a simple AI service running as a container.
4. **Resource Requirements**: We'll define resource requests and limits for our containers, which HPA can use to make scaling decisions.

Here's how you would typically set up the HPA using `pulumi_kubernetes` for an AI workload:

```python
import pulumi
import pulumi_kubernetes as kubernetes

# Create a Kubernetes Deployment
app_labels = {"app": "ai-service"}
deployment = kubernetes.apps.v1.Deployment(
    "ai-service-deployment",
    spec=kubernetes.apps.v1.DeploymentSpecArgs(
        selector=kubernetes.meta.v1.LabelSelectorArgs(
            match_labels=app_labels,
        ),
        replicas=3,  # Start with 3 replicas
        template=kubernetes.core.v1.PodTemplateSpecArgs(
            metadata=kubernetes.meta.v1.ObjectMetaArgs(
                labels=app_labels,
            ),
            spec=kubernetes.core.v1.PodSpecArgs(
                containers=[
                    kubernetes.core.v1.ContainerArgs(
                        name="ai-service",
                        image="ai-service:latest",  # Replace with your actual AI service image
                        resources=kubernetes.core.v1.ResourceRequirementsArgs(
                            requests={
                                "cpu": "500m",  # Request half CPU core
                                "memory": "1Gi",  # Request 1 GiB of memory
                            },
                            limits={
                                "cpu": "1",  # Limit to one CPU core
                                "memory": "2Gi",  # Limit to 2 GiB of memory
                            }
                        ),
                    ),
                ],
            ),
        ),
    ))

# Create a HorizontalPodAutoscaler to automatically scale our AI workload
hpa = kubernetes.autoscaling.v1.HorizontalPodAutoscaler(
    "ai-service-hpa",
    spec=kubernetes.autoscaling.v1.HorizontalPodAutoscalerSpecArgs(
        max_replicas=10,  # Scale up to 10 replicas
        min_replicas=3,  # Scale down to at least 3 replicas
        scale_target_ref=kubernetes.autoscaling.v1.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=deployment.metadata.name,
        ),
        target_cpu_utilization_percentage=50,  # Target 50% CPU utilization
    ))

# Export the number of replicas for observation
pulumi.export("ai_service_replicas", deployment.spec.replicas)
```

In the program above:

- We create a Kubernetes Deployment that has 3 replicas to begin with.
- We define a CPU and memory request and limit for the container. This is crucial because the HPA uses these values to calculate when to scale.
- We create an HPA resource that targets our deployment. This HPA will increase the number of replicas when the average CPU utilization across all pods exceeds 50%.
- We export the number of replicas to observe it from the Pulumi CLI or the Pulumi Console after deployment.

Please replace `"ai-service:latest"` with the actual image of your AI service.

To apply this program, save it in a file (for example `main.py`), ensure you have Pulumi installed and configured for Kubernetes, and run `pulumi up` to create or update resources according to the Pulumi program.