Real-Time AI Workload Autoscaling with Knative

Question

Pulumi · Accepted Answer

To enable real-time AI workload autoscaling with Knative, we will need to deploy a Kubernetes cluster, install Knative Serving, and then configure autoscaling to handle the dynamic nature of AI workloads effectively.

Knative Serving provides a request-driven compute model where applications autoscale almost in real-time based on incoming traffic. This is perfect for real-time AI workloads which might receive sporadic or unpredictable traffic.

First, we need to use Pulumi to deploy a Kubernetes cluster. The specific configuration will depend on the cloud provider you choose. For simplicity, let's assume you're using Google Kubernetes Engine (GKE) as the cloud provider. We'll set up a GKE cluster, install Knative Serving on top of it, and then we will configure Knative to autoscale our AI application workloads based on custom metrics, such as the number of in-flight requests.

Here's a Pulumi program that will:

1. Create a GKE cluster.
2. Deploy Knative Serving.
3. Configure Knative Serving for autoscaling.

Please ensure you have Pulumi and `kubectl` installed and configured for your GCP account.

```python
import pulumi
import pulumi_gcp as gcp
from pulumi_kubernetes import Provider, helm

# Step 1: Create a GKE cluster
cluster = gcp.container.Cluster("ai-workload-cluster",
    initial_node_count=3,
    min_master_version="latest",
    node_config={
        "machineType": "n1-standard-4",
        "oauth_scopes": [
            "https://www.googleapis.com/auth/compute",
            "https://www.googleapis.com/auth/devstorage.read_only",
            "https://www.googleapis.com/auth/logging.write",
            "https://www.googleapis.com/auth/monitoring"
        ],
    },
)

# Step 2: Set up the Kubernetes Provider for Pulumi using the generated Kubeconfig
kubeconfig = pulumi.Output.all(cluster.name, cluster.endpoint, cluster.master_auth).apply(
    lambda args: """apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: {0}
    server: https://{1}
  name: gke_cluster
contexts:
- context:
    cluster: gke_cluster
    user: gke_cluster_user
  name: gke_cluster
current-context: gke_cluster
kind: Config
preferences: {{}}
users:
- name: gke_cluster_user
  user:
    auth-provider:
      config:
        cmd-args: config view --minify --flatten --output 'jsonpath={{{{.users[].name}}}}'
        cmd-path: gcloud
        expiry-key: '{{{{.credential.token_expiry}}}}'
        token-key: '{{{{.credential.access_token}}}}'
      name: gcp
""".format(args[2]["clusterCaCertificate"], args[1])
)

k8s_provider = Provider("gke-k8s", kubeconfig=kubeconfig)

# Step 3: Install Knative Serving using Helm chart
knative_chart = helm.v3.Chart("knative-serving",
    config=helm.v3.ChartOpts(
        chart="knative-serving",
        version="0.21.0",
        fetch_opts=helm.v3.FetchOpts(
            repo="https://knative.dev/charts",
        ),
    ),
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Step 4: Configure Knative Serving for Autoscaling
# Here you would deploy your Knative service and set autoscaling parameters 
# such as minScale, maxScale, target rate etc., depending on the workload characteristics.
# This step assumes that you have a Knative serving YAML ready for deployment.

# knative_serving_yaml = ... # Your Knative Serving resource definition, as a YAML or JSON string

# knative_serving = pulumi_kubernetes.yaml.ConfigGroup(
#     "knative-serving-config",
#     files=[knative_serving_yaml],
#     opts=pulumi.ResourceOptions(provider=k8s_provider),
# )

# Pulumi execution starts: Infrastructure and Knative will be deployed as defined above.
pulumi.export("kubeconfig", kubeconfig)
```

In the above program:

- **Step 1**: We create a GKE cluster with the necessary oauth scopes and machine type suitable for an AI workload.
- **Step 2**: We configure the Kubernetes provider in Pulumi with the kubeconfig of the created GKE cluster.
- **Step 3**: We install Knative Serving on the cluster using a Helm chart.
- **Step 4**: Here, you'd deploy your own Knative service. This is where you would set Knative autoscaling parameters, but since this depends on the specifics of your workload, it is not explicitly defined in the code.

Remember that before applying this configuration, you must have configured your Pulumi CLI for GCP and Kubernetes.

Once you deploy the Knative services that define your AI workloads, you can configure autoscaling properties such as `minScale`, `maxScale`, and the `target` concurrency for each service to ensure that they can automatically scale in real-time based on the workload's characteristics.

Knative Serving will watch the traffic flow in real-time and scale up or down the number of pods running your AI workload to adjust to the incoming request volume.

Please replace the `knative_serving_yaml` placeholder with your actual Knative service definitions to proceed with your use case.