Auto-Scaling ML Pipelines with Kubernetes HPA

Question

Pulumi · Accepted Answer

To set up auto-scaling for machine learning (ML) pipelines in a Kubernetes environment, we can leverage the Horizontal Pod Autoscaler (HPA). Kubernetes HPA automatically scales the number of pod replicas in a deployment or replicaset based on observed CPU utilization or other select metrics.

The HPA adjusts the number of replicas in a deployment or replicaset to match the observed average CPU utilization to the target specified by the user.

Below, I am going to illustrate a Pulumi program in Python that defines an ML pipeline deployment and a corresponding HPA resource. The program will:

1. Create a Kubernetes Deployment that defines the desired state of the ML pipeline, including the container image to run.
2. Define an HPA resource associated with the Deployment created in the first step. The HPA will target a certain average CPU utilization across all pods controlled by the deployment, scaling the number of replicas up or down based on this target.

Assumptions:
- You have a Kubernetes cluster up and running.
- Pulumi CLI and the necessary cloud providers are already set up.
- The Docker image for the ML pipeline is available (replaced with `your-ml-pipeline-image:latest` in the code).
- You have configured `kubectl` to communicate with your Kubernetes cluster.

Now, let's start with the Pulumi program:

```python
import pulumi
import pulumi_kubernetes as k8s

# Define the ML pipeline deployment.
ml_pipeline_deployment = k8s.apps.v1.Deployment(
    "ml-pipeline-deployment",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        replicas=2,  # Starting number of replicas
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "ml-pipeline"},
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "ml-pipeline"},
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="ml-pipeline-container",
                        image="your-ml-pipeline-image:latest",  # Replace with your actual image
                        resources=k8s.core.v1.ResourceRequirementsArgs(
                            requests={
                                "cpu": "500m",  # Requested CPU resources
                            },
                            limits={
                                "cpu": "1000m",  # CPU resource limits
                            },
                        ),
                    ),
                ],
            ),
        ),
    ),
)

# Define the Horizontal Pod Autoscaler.
ml_pipeline_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler(
    "ml-pipeline-hpa",
    spec=k8s.autoscaling.v1.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=k8s.autoscaling.v1.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=ml_pipeline_deployment.metadata.name,
        ),
        min_replicas=1,  # Minimum number of replicas
        max_replicas=10,  # Maximum number of replicas
        target_cpu_utilization_percentage=80,  # Target CPU utilization percentage
    ),
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="ml-pipeline-hpa",
        labels={"app": "ml-pipeline"},
    ),
)

# Export the name of the deployment
pulumi.export("ml_pipeline_deployment_name", ml_pipeline_deployment.metadata.name)

# Export the name of the HPA
pulumi.export("ml_pipeline_hpa_name", ml_pipeline_hpa.metadata.name)

```

Here's what each part of this Pulumi program is doing:

- The `ml_pipeline_deployment` resource creates a Kubernetes Deployment that specifies the desired state for the ML pipeline. We start with two replicas, and the pods have the `app: ml-pipeline` label so that they can be identified by the HPA.

- The container in the deployment specifies resource requests and limits for CPU resources. Requests are guarantees for resources reserved for the container, while limits define the maximum resources a container can use.

- The `ml_pipeline_hpa` resource defines a Horizontal Pod Autoscaler that targets the deployment we created above. The `scale_target_ref` links the HPA to the deployment using its API version, kind, and name.

- The `min_replicas` and `max_replicas` fields specify the lower and upper bounds for the number of replicas the HPA can scale to.

- The `target_cpu_utilization_percentage` is the average CPU utilization that the HPA tries to maintain across all the pods; in this case, it's set to 80%.

Finally, we export the names of the deployment and the HPA so they can be easily retrieved from the Pulumi stack.

To run this program, save it in a file (for example `autoscaling_ml_pipeline.py`), then execute it with Pulumi CLI tooling. Assuming you've named your Pulumi project and your k8s cluster is correctly configured, this should set up the autoscaling for your ML pipeline.