Orchestrating AI Model Serving Workloads with Kubernetes

Question

Pulumi · Accepted Answer

Orchestrating AI model serving workloads with Kubernetes typically involves creating multiple resources that work together to manage the lifecycle and scaling of your model serving application. For such tasks, you might define a `Deployment` to manage the pods that run your containers, `Service` objects to expose your application within the cluster or to the outside world, an optional `Ingress` object if you need HTTP routing, `HorizontalPodAutoscaler` for scaling, and perhaps `CronJobs` for running periodic tasks.

In a Pulumi program, you'd write Python code that uses the Pulumi Kubernetes SDK to define these resources. The SDK provides classes and functions that correspond to Kubernetes resources, which you can assemble into a program to declaratively define the desired state of your cluster.

Below is a Python program using Pulumi to orchestrate an AI model serving workload on Kubernetes. The program:

1. Sets up a `Namespace` for organizing resources.
2. Defines a `Deployment` for running model serving containers.
3. Creates a `Service` to expose the deployment within the cluster.
4. Sets up a `HorizontalPodAutoscaler` to automatically scale the number of pods.
5. Optionally, creates a `CronJob` for periodic AI model retraining tasks.

Let's look at a sample Pulumi program:

```python
import pulumi
from pulumi_kubernetes.apps.v1 import Deployment
from pulumi_kubernetes.core.v1 import Service
from pulumi_kubernetes.autoscaling.v2beta1 import HorizontalPodAutoscaler
from pulumi_kubernetes.batch.v1beta1 import CronJob

# Define a Namespace for your AI workloads for better isolation and management
ai_namespace = pulumi_kubernetes.core.v1.Namespace("ai-model-serving-ns", metadata={"name": "ai-model-serving"})

# Define a Deployment for the AI model serving workload
ai_model_deployment = Deployment(
    "ai-model-serving-deployment",
    metadata={
        "namespace": ai_namespace.metadata["name"],
    },
    spec={
        "selector": {
            "matchLabels": {
                "app": "ai-model-serving",
            },
        },
        "replicas": 2,  # Start with 2 replicas
        "template": {
            "metadata": {
                "labels": {
                    "app": "ai-model-serving",
                },
            },
            "spec": {
                "containers": [{
                    "name": "model-container",
                    "image": "my-registry/my-model-serving-image:latest",  # Replace with your container image
                    "ports": [{"containerPort": 8080}],
                }],
            },
        },
    })

# Create a Service to expose the Deployment in the cluster
ai_model_service = Service(
    "ai-model-serving-service",
    metadata={
        "namespace": ai_namespace.metadata["name"],
        "labels": ai_model_deployment.spec["template"]["metadata"]["labels"],
    },
    spec={
        "ports": [{"port": 8080}],
        "selector": ai_model_deployment.spec["template"]["metadata"]["labels"],
        "type": "ClusterIP",  # Use "LoadBalancer" for external access if needed
    })

# Define a HorizontalPodAutoscaler for automatic scaling based on resource usage
ai_model_hpa = HorizontalPodAutoscaler(
    "ai-model-serving-hpa",
    metadata={
        "namespace": ai_namespace.metadata["name"],
    },
    spec={
        "scale_target_ref": {
            "api_version": "apps/v1",
            "kind": "Deployment",
            "name": ai_model_deployment.metadata["name"],
        },
        "min_replicas": 2,
        "max_replicas": 10,
        "resource": {
            "metrics": [{
                "name": "cpu",
                "target": {
                    "type": "Utilization",
                    "average_utilization": 80,
                },
            }],
        },
    })

# Optional: Define a CronJob for periodic tasks like retraining the AI model
ai_model_cron_job = CronJob(
    "ai-model-serving-cronjob",
    metadata={
        "namespace": ai_namespace.metadata["name"],
    },
    spec={
        "schedule": "0 2 * * *",  # Run every day at 2 am
        "job_template": {
            "spec": {
                "template": {
                    "spec": {
                        "containers": [{
                            "name": "retraining-container",
                            "image": "my-registry/my-model-retraining-image:latest",  # Replace with your retraining image
                        }],
                        "restart_policy": "OnFailure",
                    },
                },
            },
        },
    })

# Export the Service name and ClusterIP to access the model serving API
pulumi.export('service_name', ai_model_service.metadata['name'])
pulumi.export('service_cluster_ip', ai_model_service.spec['cluster_ip'])
```

In the program above, `ai_namespace` creates a separate namespace for our AI workloads, providing a way to group related resources for easier management and access control. The `ai_model_deployment` defines the desired state for deploying our model serving application. It specifies the container image, initial replica count, and the container port that should be exposed.

The `ai_model_service` resource creates a Kubernetes service that allows other pods in the cluster to communicate with the pods managed by `ai_model_deployment`. The service type `ClusterIP` makes it only accessible within the cluster. If external access is needed, `LoadBalancer` could be used instead.

`ai_model_hpa` sets up a Horizontal Pod Autoscaler that automatically scales our serving pods between 2 to 10 replicas based on the CPU utilization.

The optional `ai_model_cron_job` schedules periodic jobs, useful for tasks like retraining your model. It is set to run once a day at a specified time.

At the end, we export the name and the internal IP of