Kubernetes for Auto-Scaling AI/ML Workloads

Question

Pulumi · Accepted Answer

Auto-scaling workloads in Kubernetes is essential for AI/ML applications that often experience variable compute demands. To manage these demands efficiently, Kubernetes provides the Horizontal Pod Autoscaler (HPA), which automatically scales the number of pod replicas in a deployment, replication controller, replica set, or stateful set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics).

Here's how you can set up a basic auto-scaling environment for an AI/ML workload on Kubernetes:

1. **Deployment**: This defines the desired state of your application. For AI/ML workloads, you'll specify the container image that contains your machine learning model or application logic.

2. **Horizontal Pod Autoscaler**: This targets a deployment and defines the scaling policy (for example, target CPU utilization, minimum and maximum number of replicas).

3. **Metrics Server**: The Horizontal Pod Autoscaler relies on metrics to make scaling decisions. In most cases, you'll need to have a metrics collection solution like the Kubernetes Metrics Server deployed in your cluster which collects resource metrics from Kubelets and exposes them via the Kubernetes Metrics APIs.

In the following program, we'll define a Kubernetes deployment and a Horizontal Pod Autoscaler using Pulumi's Python SDK. This HPA will automatically adjust the number of pods based on the CPU utilization of the pods.

```python
import pulumi
import pulumi_kubernetes as kubernetes

# Define the Kubernetes Deployment for the AI/ML application.
ai_ml_deployment = kubernetes.apps.v1.Deployment(
    "ai-ml-deployment",
    spec=kubernetes.apps.v1.DeploymentSpecArgs(
        replicas=1,
        selector=kubernetes.meta.v1.LabelSelectorArgs(
            match_labels={"app": "ai-ml"},
        ),
        template=kubernetes.core.v1.PodTemplateSpecArgs(
            metadata=kubernetes.meta.v1.ObjectMetaArgs(
                labels={"app": "ai-ml"},
            ),
            spec=kubernetes.core.v1.PodSpecArgs(
                containers=[
                    kubernetes.core.v1.ContainerArgs(
                        name="ml-container",
                        image="your-ai-ml-container-image:latest",  # Replace with your AI/ML container image
                        # Define resource requests and limits for your AI/ML container.
                        # These values should be adjusted based on your workload's requirements.
                        resources=kubernetes.core.v1.ResourceRequirementsArgs(
                            requests={"cpu": "500m", "memory": "512Mi"},
                            limits={"cpu": "1000m", "memory": "1Gi"},
                        ),
                    ),
                ],
            ),
        ),
    ),
)

# Define the Horizontal Pod Autoscaler that targets the above deployment.
ai_ml_hpa = kubernetes.autoscaling.v2beta2.HorizontalPodAutoscaler(
    "ai-ml-hpa",
    spec=kubernetes.autoscaling.v2beta2.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=kubernetes.autoscaling.v2beta2.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=ai_ml_deployment.metadata.name,
        ),
        min_replicas=1,  # Define the minimum number of replicas for your AI/ML deployment.
        max_replicas=10,  # Define the maximum number of replicas for your AI/ML deployment.
        # Define the metrics for autoscaling.
        metrics=[
            kubernetes.autoscaling.v2beta2.MetricSpecArgs(
                type="Resource",
                resource=kubernetes.autoscaling.v2beta2.ResourceMetricSourceArgs(
                    name="cpu",
                    target=kubernetes.autoscaling.v2beta2.MetricTargetArgs(
                        type="Utilization",
                        average_utilization=50,  # The target average CPU utilization.
                    ),
                ),
            ),
        ],
    ),
)

# Export the name of the deployment and HPA
pulumi.export("deployment_name", ai_ml_deployment.metadata.name)
pulumi.export("hpa_name", ai_ml_hpa.metadata.name)
```

This program performs the following actions:

- Creates a Kubernetes deployment named `ai-ml-deployment` with a single replica.
- The deployment contains a pod with one container running your AI/ML application (replace `your-ai-ml-container-image:latest` with your actual container image).
- The container is configured with both request and limit for CPU and memory usage which is important to avoid the noisy neighbor problem and ensure QoS for the pod.
- It then creates a Horizontal Pod Autoscaler named `ai-ml-hpa` that targets the just-defined deployment.
- The HPA is configured to maintain an average CPU utilization across all pods of 50%. It will adjust the replica count to maintain this average.
- Finally, the deployment and HPA names are exported for easy querying using the Pulumi CLI, which might be useful for CI/CD integrations or automation scripts.

Please replace `"your-ai-ml-container-image:latest"` with the actual container image you wish to use for your AI/ML application. Adjust the CPU and memory requests and limits according to the needs of your workload. The autoscaling parameters should be tuned based on the performance characteristics of your application and the responsiveness you desire from the autoscaling system.

You need to have Metrics Server running in your cluster for CPU-based autoscaling to work. If you're using a cloud-provided Kubernetes service like GKE, EKS or AKS, the Metrics Server may already be installed or can be easily enabled. For other Kubernetes installations, you may need to [deploy Metrics Server](https://github.com/kubernetes-sigs/metrics-server) yourself.