Auto-scaling ML Model Serving with Kubernetes Runner

Question

Pulumi · Accepted Answer

Auto-scaling a machine learning model serving system in Kubernetes can involve several steps. You will need to have a Kubernetes cluster running and an application that contains your machine learning model ready for deployment.

In this scenario, the `HorizontalPodAutoscaler` resource from the Kubernetes API is used. This resource can automatically scale the number of pods in a deployment or replica set based on observed CPU utilization or on some other, application-provided metrics.

Here's how you can create a deployment that serves a machine learning model and an associated HorizontalPodAutoscaler to ensure it can handle varying loads:

1. **Deployment**: The Kubernetes deployment object which will be used to define the desired state for our ML model serving application. This includes the container image to use, which should contain your pre-built model ready to serve requests, and other specifications like the number of initial replicas, resources requests and limits, ports, etc.

2. **Service**: To expose the ML model serving application outside of the Kubernetes cluster, a service is created. The service will route the traffic to the pods which are controlled by the deployment.

3. **HorizontalPodAutoscaler**: This object will monitor the CPU or memory usage (or a custom metric) of the pods in the deployment and automatically adjust the number of replicas in response to the observed load.

Let's go through creating a simple Python program with Pulumi that sets up these resources for a fictional machine learning model serving application.

```python
import pulumi
import pulumi_kubernetes as k8s

# Define your machine learning model serving container.
# The image used here should be replaced with the Docker image that contains your machine learning model.
model_serving_container = {
    "name": "your-model-serving-application",
    "image": "your-docker-hub-username/your-model-serving-application:latest",
    "ports": [{"containerPort": 8080}],
}

# Define a Kubernetes deployment for the serving app.
app_labels = {"app": "model-serving"}
model_serving_deployment = k8s.apps.v1.Deployment(
    "model-serving-deployment",
    spec={
        "selector": {"matchLabels": app_labels},
        "replicas": 1,  # Start with one replica.
        "template": {
            "metadata": {"labels": app_labels},
            "spec": {
                "containers": [model_serving_container],
            },
        },
    },
)

# Define a Kubernetes service to expose the serving app.
model_serving_service = k8s.core.v1.Service(
    "model-serving-service",
    spec={
        "ports": [{"port": 80, "targetPort": 8080}],
        "selector": app_labels,
        "type": "ClusterIP",  # Change to LoadBalancer if you need external access.
    },
)

# Define a HorizontalPodAutoscaler to automatically scale the serving app.
model_serving_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler(
    "model-serving-hpa",
    spec={
        "scaleTargetRef": {
            "apiVersion": "apps/v1",
            "kind": "Deployment",
            "name": model_serving_deployment.metadata["name"],
        },
        "minReplicas": 1,
        "maxReplicas": 10,  # Maximum number of replicas.
        "targetCPUUtilizationPercentage": 50,  # Scale up when CPU exceeds 50%.
    },
)

# Export the URL for the model serving application. Note: If you opted for a LoadBalancer in service, this could be an external IP.
pulumi.export("model_serving_endpoint", model_serving_service.metadata["name"])
```

The above Pulumi program is a simplified example of how to setup auto-scaling for a model serving application in Kubernetes. The service and deployment specifications will depend on your actual application requirements. For example, resource limits would be adjusted to suit the needs of the application and similarly for request limits.

Please remember to replace the `image` in the `model_serving_container` variable with the path to your actual Docker image that contains your machine learning model.

The HorizontalPodAutoscaler is targeting CPU utilization to determine when to scale the deployment. If the observed CPU utilization exceeds 50%, it will increase the number of pods, and if it falls below that threshold, it will reduce the number of pods, within the defined minimum and maximum replica range.

For more about Kubernetes resources and properties in Pulumi, you can check the official [Kubernetes API docs in Pulumi](https://www.pulumi.com/registry/packages/kubernetes/api-docs/).