Scaling AI Model Serving with Kubernetes Autoscaling

Question

Pulumi · Accepted Answer

Scaling AI model serving can be efficiently managed with Kubernetes and its built-in autoscaling features. Autoscaling in Kubernetes can automatically adjust the number of running Pods based on the observed CPU utilization or custom metrics. It ensures that your AI models are served with the optimal resources they require, avoiding over-provisioning and costly idle resources.

There are two primary resources involved in Kubernetes autoscaling:

1. **Horizontal Pod Autoscaler (HPA)**: This resource automatically scales the number of Pods in a Deployment, ReplicaSet, or StatefulSet based on observed CPU utilization or custom metrics. It increases or decreases the number of Pods to meet the demand.

2. **Deployment**: While not strictly an autoscaling resource, a Deployment defines the desired state of your application. It allows Kubernetes to manage and scale your application with the correct number of Pods. We'll need a Deployment to which the HPA can attach.

Here's a simple Pulumi program in Python that creates a Kubernetes Deployment and sets up an HPA to scale our AI model service. This example uses the `pulumi_kubernetes` module.

First, ensure that you have the `pulumi` and `pulumi_kubernetes` modules installed:

```bash
pip install pulumi pulumi_kubernetes
```

Now, let's create the Pulumi program:

```python
import pulumi
import pulumi_kubernetes as kubernetes

# Name for our resources
app_name = "ai-model-service"

# Configuring a Kubernetes Deployment for the AI model serving application
app_labels = {"app": app_name}
deployment = kubernetes.apps.v1.Deployment(
    app_name,
    spec=kubernetes.apps.v1.DeploymentSpecArgs(
        replicas=1,  # Starting with one Pod
        selector=kubernetes.meta.v1.LabelSelectorArgs(match_labels=app_labels),
        template=kubernetes.core.v1.PodTemplateSpecArgs(
            metadata=kubernetes.meta.v1.ObjectMetaArgs(labels=app_labels),
            spec=kubernetes.core.v1.PodSpecArgs(
                containers=[kubernetes.core.v1.ContainerArgs(
                    name=app_name,
                    image="my-ai-model-serving-image:latest",  # Replace with your image
                    # Define the resources requests/limits for your application
                    # Here we have defined a simple CPU request/limit as an example.
                    resources=kubernetes.core.v1.ResourceRequirementsArgs(
                        requests={"cpu": "100m"},
                        limits={"cpu": "200m"}
                    ),
                    ports=[kubernetes.core.v1.ContainerPortArgs(container_port=80)],
                )],
            ),
        ),
    ))

# Setting up Horizontal Pod Autoscaler for the Deployment
hpa = kubernetes.autoscaling.v1.HorizontalPodAutoscaler(
    app_name,
    spec=kubernetes.autoscaling.v1.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=kubernetes.autoscaling.v1.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=app_name,
        ),
        min_replicas=1,  # Minimum number of Pods
        max_replicas=10,  # Maximum number of Pods
        # Target CPU utilization for scaling (represented as a percentage of the requested CPU)
        target_cpu_utilization_percentage=50,
    ),
    # Metadata provides additional info like labels and annotations - you can add more according to your needs
    metadata=kubernetes.meta.v1.ObjectMetaArgs(
        labels=app_labels
    ),
)

# Export the name of the Deployment and the HPA
pulumi.export('deployment_name', deployment.metadata['name'])
pulumi.export('hpa_name', hpa.metadata['name'])
```

This program begins by defining a Kubernetes Deployment with a single Pod. The Pod has a single container running your AI model serving application and requests a baseline of CPU resources.

Next, we define a `HorizontalPodAutoscaler` (HPA). The HPA monitors the CPU utilization of the Pods managed by the Deployment and automatically scales the number of replicas up or down to meet the CPU utilization target of 50%. The number of replicas will be maintained between 1 and 10, as specified in `min_replicas` and `max_replicas`.

Finally, we export two variables – the names of the Deployment and the HPA – so that they can be easily accessed outside of the Pulumi program.

To apply your Pulumi program, run:

```bash
pulumi up
```

This command will preview and deploy your changes. The AI serving application is now configured with autoscaling capabilities. As the demand for AI model predictions changes, Kubernetes adapts the number of Pods to ensure efficient and cost-effective operations.