Autoscaling AI Model Serving with Kubernetes

Question

Pulumi · Accepted Answer

Autoscaling AI model serving is an advanced use case, where a Kubernetes cluster is set up to automatically scale up or down the number of pods running the AI models based on the load they are receiving. This ensures that you have enough resources to serve your models when needed, but also scale down to save on costs when the load is low.

Here's a step-by-step guide and Pulumi program that sets up autoscaling for AI model serving on a Kubernetes cluster:

1. Create a `Deployment` that defines the desired state for your AI model serving application. It tells Kubernetes what containers to run, with which images and how many replicas (pods) should always run.

2. Configure `HorizontalPodAutoscaler` which automatically scales the number of pod replicas in a deployment according to observed CPU utilization or with custom metrics provided to Kubernetes.

3. Create a `Service` to expose your application to the internet or internally within the cluster, with options for load balancing and service discovery.

Below is a Pulumi program written in Python:

```python
import pulumi
import pulumi_kubernetes as k8s

# Step 1: Define the AI model serving Deployment
model_serving_deployment = k8s.apps.v1.Deployment(
    "ai-model-serving",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        replicas=1, # start with 1 replica
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "ai-model-serving"},
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "ai-model-serving"},
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[k8s.core.v1.ContainerArgs(
                    name="model-serving-container",
                    image="<your-ai-model-serving-container-image>", # replace with your actual image
                    resources=k8s.core.v1.ResourceRequirementsArgs(
                        limits={"cpu": "500m", "memory": "512Mi"},
                        requests={"cpu": "500m", "memory": "512Mi"},
                    ),
                    ports=[k8s.core.v1.ContainerPortArgs(
                        container_port=80, # assuming your app serves on port 80/tcp
                    )],
                )],
            ),
        ),
    ),
)

# Step 2: Set up the HorizontalPodAutoscaler for autoscaling
model_serving_autoscaler = k8s.autoscaling.v2.HorizontalPodAutoscaler(
    "ai-model-autoscaler",
    spec=k8s.autoscaling.v2.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=k8s.autoscaling.v2.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=model_serving_deployment.metadata.name,
        ),
        min_replicas=1, # minimum number of replicas
        max_replicas=10, # maximum number of replicas
        # Define the metrics for autoscaling (CPU utilization in this case)
        metrics=[k8s.autoscaling.v2.MetricSpecArgs(
            type="Resource",
            resource=k8s.autoscaling.v2.ResourceMetricSourceArgs(
                name="cpu",
                target=k8s.autoscaling.v2.MetricTargetArgs(
                    type="Utilization",
                    average_utilization=80, # target CPU utilization percentage to scale at
                ),
            ),
        )],
    ),
)

# Step 3: Create a Kubernetes Service to expose the AI model serving application
model_serving_service = k8s.core.v1.Service(
    "ai-model-serving-service",
    spec=k8s.core.v1.ServiceSpecArgs(
        selector={"app": "ai-model-serving"},
        ports=[k8s.core.v1.ServicePortArgs(
            protocol="TCP",
            port=80,
            target_port=80,
        )],
        type="LoadBalancer", # Expose the service outside the cluster
    ),
)

# Export the model serving service's IP address
pulumi.export("model_serving_service_ip", model_serving_service.status.apply(lambda status: status.load_balancer.ingress[0].ip))
```

What this program does:
- Sets up a Kubernetes deployment named `ai-model-serving` with initially one replica.
- The deployment uses an example `image` for serving the AI model; you would replace the placeholder with the actual image you want to deploy.
- The `HorizontalPodAutoscaler` monitors the CPU utilization of the pods and will scale the number of replicas between 1 and 10 based on the load.
- A `Service` of type `LoadBalancer` exposes the deployment on an IP accessible outside of the Kubernetes cluster.
- Exports the external IP address of the model serving service for easy access.

What you need to provide:
- The image name for your AI model serving container.
- Any port configurations that your model serving application requires.

Before running this program, you'll have to set up Pulumi with Kubernetes and provide your own image for the model serving. Please note that this is a high-level overview and for a production setup, you would need to take into account security, proper resource limits, monitoring, and logging.