Auto-Scaling Inference Services with Kubernetes

Question

Pulumi · Accepted Answer

Auto-scaling in Kubernetes is a method that automatically adjusts the number of running pods in a deployment based on the current load or other metrics (like CPU or memory usage). It ensures that your applications have the resources they need when they need them without over-provisioning and incurring unnecessary costs.

To implement auto-scaling for inference services or any other services running in a Kubernetes cluster, you'll typically need two main components:

1. A Horizontal Pod Autoscaler (HPA): It automatically scales the number of pods in a replication controller, deployment, replica set, or stateful set based on observed CPU utilization or, with custom metrics support, on some other application-provided metrics.

2. Metrics Server: It is an aggregator of resource usage data in your cluster, and it is required for the HPA to function.

In a Pulumi program, you can use the `kubernetes.autoscaling.v1.HorizontalPodAutoscaler` class for creating an HPA. This class represents the HPA in the Kubernetes API. Below is a program written in Python that demonstrates how to create an auto-scaling deployment in Kubernetes using Pulumi.

```python
import pulumi
import pulumi_kubernetes as k8s

# Define the application's deployment.
app_labels = {"app": "inference-service"}
deployment = k8s.apps.v1.Deployment(
    "inference-deployment",
    spec={
        "selector": {"matchLabels": app_labels},
        "replicas": 1,
        "template": {
            "metadata": {"labels": app_labels},
            "spec": {
                "containers": [
                    {
                        "name": "inference-container",
                        "image": "my-inference-service:latest",
                        # Define other container properties as required by the inference service.
                    }
                ]
            },
        },
    })

# Define a Horizontal Pod Autoscaler for the deployment, targeting 50% CPU utilization.
hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler(
    "inference-hpa",
    spec={
        "maxReplicas": 10,
        "minReplicas": 1,
        "scaleTargetRef": {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "name": deployment.metadata["name"],
        },
        "targetCPUUtilizationPercentage": 50,
    },
    metadata={"labels": app_labels}
)

# Export the name of the deployment
pulumi.export("deployment_name", deployment.metadata["name"])
```

This Pulumi program performs the following steps:

1. It declares the labels for the pods as a dictionary, which is used to associate the deployment with the HPA.
2. It creates a Kubernetes `Deployment`, which describes the desired state of the inference service. This includes the number of replicas, pod template (including container images and other settings).
3. It then creates a `HorizontalPodAutoscaler`, which targets the deployment created in step 2. The HPA is configured to maintain an average CPU utilization across all pods of 50%. It will scale the number of replicas between 1 and 10 based on that target.

To deploy this on a Kubernetes cluster with the Pulumi CLI, ensure that you’ve got Pulumi installed and correctly configured to access your Kubernetes cluster. Also, ensure you have the Pulumi Python SDK installed. Save this code in a `__main__.py` file, and run `pulumi up` to deploy your changes. The deployment's name is exported at the end of the program, which can be helpful to query or update it outside of Pulumi.

To make this work, your Kubernetes cluster must have the Metrics Server deployed so that the HPA can retrieve metrics to make scaling decisions. If it's not already installed, you can install it using Pulumi or other methods, like Helm charts. The Horizontal Pod Autoscaler will adjust the number of pod replicas based on the CPU usage observed by the Metrics Server.