1. Rate-Limited Ingress for AI Model Serving with Kubernetes


    When serving AI models with Kubernetes, you might want to limit the rate of incoming requests to ensure the stability and high availability of your service. Rate limiting can prevent your service from being overwhelmed by too many requests at once, which could lead to resource exhaustion and increased latency. In a Kubernetes environment, you can implement rate limiting at the ingress level.

    To do this, we'll use the following resources:

    • Ingress: Kubernetes Ingress objects manage external access to the services in a cluster, typically HTTP. An Ingress can provide load balancing, SSL termination, and name-based virtual hosting.
    • nginx-ingress: We'll use NGINX Ingress Controller to handle the traffic as it provides powerful ways to manage traffic, including rate limiting. NGINX Ingress Controller uses ConfigMap to store the configuration.

    Here's how to set up rate-limited ingress for AI model serving with Kubernetes using Pulumi with Python:

    1. Deploy an nginx-ingress controller, configured with rate limiting. We'll define annotations in our ingress resources to use this rate limiting.
    2. Deploy a Kubernetes Ingress resource for the service that serves your AI model. It will use the nginx-ingress controller and its rate limiting configuration.

    Below is a Pulumi program that sets up an nginx-ingress controller with a simple rate limiting configuration and deploys an Ingress resource with annotations for rate limiting.

    import pulumi import pulumi_kubernetes as k8s # Deploying the NGINX Ingress Controller using Helm nginx_ingress_controller = k8s.helm.v3.Chart( 'nginx-ingress', k8s.helm.v3.ChartOpts( chart='ingress-nginx', version='3.36.0', namespace='default', fetch_opts=k8s.helm.v3.FetchOpts( repo='https://kubernetes.github.io/ingress-nginx' ), values={ 'controller': { 'config': { 'rate-limiting-enable': 'true', # Define other rate limiting configurations here } } } ) ) # Service to be exposed by the Ingress, assumed to already exist ai_model_service_name = 'ai-model-service' # Deploying an Ingress resource with rate limiting for the AI model service ai_model_ingress = k8s.networking.v1.Ingress( 'ai-model-ingress', metadata=k8s.meta.v1.ObjectMetaArgs( name='ai-model-ingress', annotations={ # Enabling rate limiting on this specific Ingress resource 'nginx.ingress.kubernetes.io/limit-rpm': '30', # This limits to 30 requests per minute # Can also set 'limit-rps' for requests per second }, ), spec=k8s.networking.v1.IngressSpecArgs( ingress_class_name='nginx', # Using the NGINX Ingress class rules=[k8s.networking.v1.IngressRuleArgs( http=k8s.networking.v1.HTTPIngressRuleValueArgs( paths=[k8s.networking.v1.HTTPIngressPathArgs( path='/', path_type='Prefix', backend=k8s.networking.v1.IngressBackendArgs( service=k8s.networking.v1.IngressServiceBackendArgs( name=ai_model_service_name, port=k8s.networking.v1.ServiceBackendPortArgs( number=80, # Port on which the AI model service is listening ), ), ), )], ), )], ), ) # Export the Ingress status as an output - this will include the assigned load balancer IP or hostname pulumi.export('ingress_status', ai_model_ingress.status)

    To explain the important parts of this program:

    • We used pulumi_kubernetes.helm.v3.Chart to deploy the NGINX Ingress Controller from its Helm chart. Helm is a package manager for Kubernetes, which allows us to deploy applications as a collection of pre-configured Kubernetes resources.

    • In the nginx-ingress configuration we set 'rate-limiting-enable': 'true' under the controller.config property. This enables rate limiting. The actual rate limiting annotation is then applied per Ingress resource, giving you granular control.

    • We created an Ingress resource for an existing ai-model-service. The annotation 'nginx.ingress.kubernetes.io/limit-rpm': '30' is used to enforce the rate limiting at 30 requests per minute.

    • Finally, we export the status of the ingress, which includes valuable information such as the external IP address or hostname that's been assigned by the ingress controller.

    Deploying this Pulumi program sets up an environment where your AI model service can serve requests at a controlled rate, adding a layer of protection against traffic spikes.