Rate-Limited Ingress for AI Model Serving with Kubernetes

Question

Pulumi · Accepted Answer

When serving AI models with Kubernetes, you might want to limit the rate of incoming requests to ensure the stability and high availability of your service. Rate limiting can prevent your service from being overwhelmed by too many requests at once, which could lead to resource exhaustion and increased latency. In a Kubernetes environment, you can implement rate limiting at the ingress level.

To do this, we'll use the following resources:
- **Ingress**: Kubernetes `Ingress` objects manage external access to the services in a cluster, typically HTTP. An `Ingress` can provide load balancing, SSL termination, and name-based virtual hosting.
- **nginx-ingress**: We'll use NGINX Ingress Controller to handle the traffic as it provides powerful ways to manage traffic, including rate limiting. NGINX Ingress Controller uses ConfigMap to store the configuration.

Here's how to set up rate-limited ingress for AI model serving with Kubernetes using Pulumi with Python:

1. Deploy an `nginx-ingress` controller, configured with rate limiting. We'll define annotations in our ingress resources to use this rate limiting.
2. Deploy a Kubernetes `Ingress` resource for the service that serves your AI model. It will use the `nginx-ingress` controller and its rate limiting configuration.

Below is a Pulumi program that sets up an `nginx-ingress` controller with a simple rate limiting configuration and deploys an `Ingress` resource with annotations for rate limiting.

```python
import pulumi
import pulumi_kubernetes as k8s

# Deploying the NGINX Ingress Controller using Helm
nginx_ingress_controller = k8s.helm.v3.Chart(
    'nginx-ingress',
    k8s.helm.v3.ChartOpts(
        chart='ingress-nginx',
        version='3.36.0',
        namespace='default',
        fetch_opts=k8s.helm.v3.FetchOpts(
            repo='https://kubernetes.github.io/ingress-nginx'
        ),
        values={
            'controller': {
                'config': {
                    'rate-limiting-enable': 'true',
                    # Define other rate limiting configurations here
                }
            }
        }
    )
)

# Service to be exposed by the Ingress, assumed to already exist
ai_model_service_name = 'ai-model-service'

# Deploying an Ingress resource with rate limiting for the AI model service
ai_model_ingress = k8s.networking.v1.Ingress(
    'ai-model-ingress',
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name='ai-model-ingress',
        annotations={
            # Enabling rate limiting on this specific Ingress resource
            'nginx.ingress.kubernetes.io/limit-rpm': '30',  # This limits to 30 requests per minute
            # Can also set 'limit-rps' for requests per second
        },
    ),
    spec=k8s.networking.v1.IngressSpecArgs(
        ingress_class_name='nginx',  # Using the NGINX Ingress class
        rules=[k8s.networking.v1.IngressRuleArgs(
            http=k8s.networking.v1.HTTPIngressRuleValueArgs(
                paths=[k8s.networking.v1.HTTPIngressPathArgs(
                    path='/',
                    path_type='Prefix',
                    backend=k8s.networking.v1.IngressBackendArgs(
                        service=k8s.networking.v1.IngressServiceBackendArgs(
                            name=ai_model_service_name,
                            port=k8s.networking.v1.ServiceBackendPortArgs(
                                number=80,  # Port on which the AI model service is listening
                            ),
                        ),
                    ),
                )],
            ),
        )],
    ),
)

# Export the Ingress status as an output - this will include the assigned load balancer IP or hostname
pulumi.export('ingress_status', ai_model_ingress.status)
```

To explain the important parts of this program:

- We used `pulumi_kubernetes.helm.v3.Chart` to deploy the NGINX Ingress Controller from its Helm chart. Helm is a package manager for Kubernetes, which allows us to deploy applications as a collection of pre-configured Kubernetes resources.

- In the `nginx-ingress` configuration we set `'rate-limiting-enable': 'true'` under the `controller.config` property. This enables rate limiting. The actual rate limiting annotation is then applied per Ingress resource, giving you granular control.

- We created an `Ingress` resource for an existing `ai-model-service`. The annotation `'nginx.ingress.kubernetes.io/limit-rpm': '30'` is used to enforce the rate limiting at 30 requests per minute.

- Finally, we export the status of the ingress, which includes valuable information such as the external IP address or hostname that's been assigned by the ingress controller.

Deploying this Pulumi program sets up an environment where your AI model service can serve requests at a controlled rate, adding a layer of protection against traffic spikes.