Auto-Scaling AI Services with Kubernetes HPA

Question

Pulumi · Accepted Answer

Autoscaling AI services on Kubernetes can be a powerful way to ensure your applications maintain performance while keeping costs in check. We'll use the Horizontal Pod Autoscaler (HPA) resource in Kubernetes to manage the scaling of our pods in a deployment. The HPA will automatically scale the number of pods up or down depending on CPU utilization or other select metrics.

The `HorizontalPodAutoscaler` resource is part of the autoscaling API group in Kubernetes. It allows you to specify how the performance of the application should be measured and when to add or remove pods based on these metrics.

Here's how we can define an HPA with Pulumi:

1. **Deployment**: First, we need to have a Kubernetes deployment in place. This deployment controls a set of pods that runs our AI service.

2. **Metrics**: Then, we decide on the metrics for scaling. CPU and memory usage are common metrics. Custom metrics can also be used if you need to scale based on the specific behavior of your application (like queue length).

3. **HPA Resource**: With Pulumi, we can define an `HorizontalPodAutoscaler` resource that targets our deployment. We'll set minimum and maximum counts for the number of pods and define the target CPU utilization percentage that triggers the scaling operation.

Here is a Python program using Pulumi to create an HPA resource that scales an AI service based on CPU utilization:

```python
import pulumi
import pulumi_kubernetes as k8s

# Define a Kubernetes deployment for the AI service.
app_labels = {"app": "ai-service"}
ai_service_deployment = k8s.apps.v1.Deployment(
    "aiServiceDeployment",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        selector=k8s.meta.v1.LabelSelectorArgs(match_labels=app_labels),
        replicas=2, # initial replica count
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(labels=app_labels),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[k8s.core.v1.ContainerArgs(
                    name="ai-service",
                    image="your-ai-service-image:latest", # replace with your actual image
                    resources=k8s.core.v1.ResourceRequirementsArgs(
                        limits={"cpu": "500m", "memory": "512Mi"},
                        requests={"cpu": "500m", "memory": "512Mi"}
                    )
                )]
            )
        )
    )
)

# Define a HorizontalPodAutoscaler for the AI service.
ai_service_hpa = k8s.autoscaling.v2.HorizontalPodAutoscaler(
    "aiServiceHPA",
    spec=k8s.autoscaling.v2.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=k8s.autoscaling.v2.CrossVersionObjectReferenceArgs(
            api_version="apps/v1",
            kind="Deployment",
            name=ai_service_deployment.metadata.name
        ),
        min_replicas=2,
        max_replicas=10,
        metrics=[k8s.autoscaling.v2.MetricSpecArgs(
            type="Resource",
            resource=k8s.autoscaling.v2.ResourceMetricSourceArgs(
                name="cpu",
                target=k8s.autoscaling.v2.MetricTargetArgs(
                    type="Utilization",
                    average_utilization=80 # Target CPU utilization to trigger scaling
                ),
            ),
        )],
    )
)

# Export the name of the HPA
pulumi.export('horizontal_pod_autoscaler', ai_service_hpa.metadata.name)
```

In this program:

- We create a `Deployment` named `aiServiceDeployment` for the AI service.
- We specify the resource `limits` and `requests` for CPU and memory to ensure proper resource allocation for our containers.
- A `HorizontalPodAutoscaler` named `aiServiceHPA` is then linked to this deployment. It will monitor the CPU utilization across all the pods managed by the deployment.
- We set the `scale_target_ref` to point our HPA to the deployment we wish to scale.
- `min_replicas` and `max_replicas` define the lower and upper bounds for pod replication.
- We set `average_utilization` to 80, which means that if the average CPU utilization goes above 80%, the HPA will trigger the creation of new pods to balance the load until the maximum number of pods (`max_replicas`) is reached.

Please replace `your-ai-service-image:latest` with the actual Docker image you want to deploy.

This Pulumi code creates a deployment for your AI service and an autoscaler that ensures your service scales with the demand. The autoscaler will monitor the CPU usage of your service, and will scale in (reduce the number of pods) or scale out (increase the number of pods) based on the defined criteria.