Auto-Scaling ML Workloads with Kubernetes Horizontal Pod Autoscaler

Question

Pulumi · Accepted Answer

The Kubernetes Horizontal Pod Autoscaler (HPA) automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or with some other metric support. It's useful for scaling ML workloads that may have variable load and require more resources at times to maintain performance.

Below I will guide you through the basics of how to set up auto-scaling for ML workloads using Pulumi and Kubernetes HPA. This program will create an HPA resource targeting a Kubernetes deployment specified by the user.

The main components to set up auto-scaling for ML workloads are:
- A Kubernetes `Deployment` that runs your ML application.
- A `Service` to expose the application, if it needs to be accessible.
- A `HorizontalPodAutoscaler` to automatically scale the deployment based on defined rules.

In the Pulumi program, first, we ensure that we have a deployment ready that the HPA can target. We then define the HPA with the specific rules for scaling up or down.

In this example, let's consider the scenario where an ML application is packaged in a Docker image, and we want to scale based on CPU utilization.

Here is a Pulumi program that sets up auto-scaling for such a workload:

```python
import pulumi
import pulumi_kubernetes as k8s
from pulumi_kubernetes.autoscaling.v2beta2 import HorizontalPodAutoscaler

# Assume that the user already has a deployment they want to scale.
# This is how you would get a reference to an existing deployment in the namespace 'default'.
app_name = 'ml-app'
app_labels = {'app': app_name}
existing_deployment = k8s.apps.v1.Deployment.get(
    'existing-deployment',
    pulumi.ResourceOptions(),
    api_version='apps/v1',
    kind='Deployment',
    metadata={
        'namespace': 'default',
        'name': app_name
    }
)

# Define the HPA
ml_app_hpa = HorizontalPodAutoscaler(
    'ml-app-hpa',
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name='ml-app-hpa',
        namespace='default'
    ),
    spec=k8s.autoscaling.v2beta2.HorizontalPodAutoscalerSpecArgs(
        max_replicas=10,  # Maximum number of replicas to which the application can be scaled
        min_replicas=2,   # Minimum number of replicas of the application
        scale_target_ref=k8s.autoscaling.v2beta2.CrossVersionObjectReferenceArgs(
            api_version='apps/v1',
            kind='Deployment',
            name=existing_deployment.metadata['name']
        ),
        metrics=[k8s.autoscaling.v2beta2.MetricSpecArgs(
            type='Resource',
            resource=k8s.autoscaling.v2beta2.ResourceMetricSourceArgs(
                name='cpu',
                target=k8s.autoscaling.v2beta2.MetricTargetArgs(
                    # Target percentage of CPU utilization at which to scale
                    type='Utilization',
                    average_utilization=50,
                )
            )
        )]
    )
)

pulumi.export('horizontal_pod_autoscaler_name', ml_app_hpa.metadata['name'])
```

In this program:
- We define `app_name` and `app_labels`, which are references to an existing deployment for an ML application you wish to auto-scale.
- We use `pulumi_kubernetes.apps.v1.Deployment.get` to get an existing deployment by its name and namespace.
- We set up `HorizontalPodAutoscaler` with a specification including `min_replicas` and `max_replicas` which define the minimum and maximum number of pods running.
- Under `metrics`, we tell the autoscaler to target CPU utilization for our auto-scaling metric. The `average_utilization` set to 50 means the HPA will trigger scaling if the average CPU utilization across all pods goes above 50%.
- A `CrossVersionObjectReferenceArgs` is used to reference the target deployment we want to scale based on the defined metrics.

The final `pulumi.export` is used to output the name of the HPA for retrieval with Pulumi CLI.

Make sure you have Pulumi CLI installed and the Kubernetes cluster context is correctly set for Pulumi to operate in. This Pulumi program should be run using Pulumi CLI commands `pulumi up` to create the resources, and `pulumi destroy` to clean them up. Remember that by setting up the Horizontal Pod Autoscaler, you're instructing Kubernetes to manage the number of pods dynamically, which is crucial to handle varying workloads for your ML applications efficiently.