Event-driven Deep Learning Inference with KEDA ScaledObjects

Question

Pulumi · Accepted Answer

To create an event-driven deep learning inference with KEDA (Kubernetes-based Event-Driven Autoscaling), you will need to deploy a set of resources within a Kubernetes cluster. Since Pulumi allows you to manage Kubernetes resources using infrastructure as code, you can use it to define and deploy these resources.

Here is a high-level overview of the steps you'll take to achieve this:

1. Set up a deep learning inference service as a Kubernetes Deployment, which can handle incoming requests.
2. Configure a KEDA ScaledObject to monitor events (such as messages in a queue) and automatically scale the number of inference service pods based on the load.

For illustration, let's assume you have a message queue (like RabbitMQ, AWS SQS, Azure Service Bus, etc.) holding inference requests, and you want to scale your deep learning service based on the length of this queue.

Here’s how you can accomplish this with Pulumi, using Python as the programming language:

- First, you need to have a Kubernetes cluster and the `pulumi_kubernetes` SDK installed.
- Next, you'll define a `Deployment` for your deep learning inference service.
- After that, you'll need to set up a `ScaledObject` using KEDA where you specify the details of the event source (e.g., message queue) and the scaling parameters.

Below is an example Pulumi program written in Python that sets up an inference service and scales it using KEDA based on the number of messages in an Azure Service Bus queue:

```python
import pulumi
import pulumi_kubernetes as k8s

# Define the Kubernetes Deployment for the deep learning inference service.
inference_service_deployment = k8s.apps.v1.Deployment(
    "inference-deployment",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        replicas=1,  # Start with one pod.
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "inference-service"},
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "inference-service"},
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[k8s.core.v1.ContainerArgs(
                    name="inference-container",
                    image="your-docker-image",  # Replace with your own deep learning inference Docker image.
                    ports=[k8s.core.v1.ContainerPortArgs(container_port=80)],
                    # You can define environment variables, resources, volume mounts, etc. here.
                )],
            ),
        ),
    ),
)

# Define the KEDA ScaledObject to scale the deployment based on Azure Service Bus queue length.
keda_scaledobject = k8s.apiextensions.CustomResource(
    "keda-scaledobject",
    api_version="keda.sh/v1alpha1",
    kind="ScaledObject",
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="inference-service-scaler",
        labels={"deploymentName": "inference-deployment"},
    ),
    spec={
        "scaleTargetRef": {
            "name": "inference-deployment",  # The name of the deployment defined above.
        },
        "pollingInterval": 30,             # How often KEDA will poll the Azure Service Bus queue (in seconds).
        "cooldownPeriod":  300,            # The period to wait after the last event to scale down the pods.
        "minReplicaCount": 1,              # Minimum number of pods.
        "maxReplicaCount": 10,             # Maximum number of pods.
        "triggers": [
            {
                "type": "azure-servicebus",
                "metadata": {
                    "queueName": "your-service-bus-queue-name",  # Replace with your Service Bus queue name.
                    "namespace": "your-service-bus-namespace",   # Replace with your Service Bus namespace.
                    "connectionFromEnv": "SERVICEBUS_CONNECTION_STRING",  # Environment variable with connection string.
                },
            },
        ],
    },
)

# Export the deployment name and scaler name
pulumi.export('deployment_name', inference_service_deployment.metadata["name"])
pulumi.export('scaler_name', keda_scaledobject.metadata["name"])
```

Make sure to replace `your-docker-image`, `your-service-bus-queue-name`, and `your-service-bus-namespace` with the appropriate values for your environment, and set the `SERVICEBUS_CONNECTION_STRING` environment variable for the KEDA `ScaledObject`.

This code defines a simple deployment for an inference service and a `ScaledObject` that configures KEDA to scale the number of pods based on the defined triggers. The `minReplicaCount` and `maxReplicaCount` fields control the minimum and maximum number of pods that KEDA will scale out to, respectively.

Please note that KEDA supports a variety of scalers for different event sources. You can check out the [KEDA documentation](https://keda.sh/docs/scalers/) to find out more about the different types of scalers and the configuration options that they support.

To put this into action, you'll need to have Pulumi installed and configured to communicate with your Kubernetes cluster, and you'll need to deploy KEDA to your cluster if it's not already present. After that, running `pulumi up` with the above code should create the necessary resources within your Kubernetes cluster and set up auto-scaling for your inference service.