Auto-scaling ML Model Deployments with Kubernetes

Question

Pulumi · Accepted Answer

Auto-scaling Machine Learning (ML) model deployments with Kubernetes allows you to handle varying loads efficiently by dynamically adjusting the number of running pods to match the current demand while optimizing costs. Kubernetes provides a resource called Horizontal Pod Autoscaler (HPA) that automatically scales the number of pod replicas in a deployment or replicaset based on observed CPU utilization or other select metrics.

Here's a step-by-step guide with a Pulumi Python program that creates a Kubernetes Deployment for an ML model and sets up auto-scaling using HPA:

1. **Define the ML Model Deployment**: We'll create a Kubernetes Deployment that defines the desired state of our ML model service. This includes the container image for the ML model, the number of replicas, resource requests (to specify the minimum amount of CPU and memory the container needs), and other relevant settings.

2. **Setup Horizontal Pod Autoscaler**: We'll define an HPA that targets the deployment, setting criteria for when to scale up or down (e.g., target CPU utilization percentage).

3. **Export Relevant Information**: After provisioning the resources, we'll export the deployment name and the HPA name.

Below is the Pulumi program written in Python to accomplish this:

```python
import pulumi
import pulumi_kubernetes as k8s

# Define the ML model deployment.
# Replace 'YOUR_CONTAINER_IMAGE' with the Docker image URI of your ML model.
ml_model_deployment = k8s.apps.v1.Deployment(
    "ml-model-deployment",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        replicas=1,  # Start with a single replica.
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "ml-model"},
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "ml-model"},
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="ml-container",
                        image="YOUR_CONTAINER_IMAGE",
                        ports=[k8s.core.v1.ContainerPortArgs(container_port=80)],
                        resources=k8s.core.v1.ResourceRequirementsArgs(
                            requests={  # Minimum resources required.
                                "cpu": "500m",
                                "memory": "512Mi"
                            },
                            limits={  # Maximum resources allowed.
                                "cpu": "1",
                                "memory": "1Gi"
                            },
                        ),
                    ),
                ],
            ),
        ),
    ),
)

# Setup horizontal pod autoscaler to auto-scale the ML model deployment.
ml_model_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler(
    "ml-model-hpa",
    spec=k8s.autoscaling.v1.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=k8s.autoscaling.v1.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=ml_model_deployment.metadata.apply(lambda metadata: metadata.name),
        ),
        min_replicas=1,  # Minimum number of replicas.
        max_replicas=10,  # Maximum number of replicas.
        target_cpu_utilization_percentage=80,  # Target CPU utilization percentage to trigger scaling.
    ),
)

# Export the deployment name and the HPA name.
pulumi.export('deployment_name', ml_model_deployment.metadata.apply(lambda metadata: metadata.name))
pulumi.export('hpa_name', ml_model_hpa.metadata.apply(lambda metadata: metadata.name))
```

In the above program, replace `'YOUR_CONTAINER_IMAGE'` with the URI of the Docker image that hosts your ML model.

The `ml_model_deployment` resource defines a Kubernetes Deployment with one replica to start with. It specifies the container image to use, exposes port 80, and outlines the resource requirements for CPU and memory.

The `ml_model_hpa` resource targets the deployment created previously and specifies the auto-scaling policy: when the target CPU utilization hits 80%, Kubernetes will start scaling the number of replicas within the defined range (1-10 replicas).

Finally, the program exports the names of the deployment and HPA so you can reference them using the Pulumi CLI or in other Pulumi programs. This is useful for debugging, referencing in other parts of infrastructure, or triggering actions based on resource creation.

For more detailed documentation on the resources used in this program, you can refer to the following links:

- [kubernetes.apps/v1.Deployment](https://www.pulumi.com/registry/packages/kubernetes/api-docs/apps/v1/deployment/)
- [kubernetes.autoscaling/v1.HorizontalPodAutoscaler](https://www.pulumi.com/registry/packages/kubernetes/api-docs/autoscaling/v1/horizontalpodautoscaler/)