Rate-Limited Ingress for AI Model Serving with Kubernetes
PythonWhen serving AI models with Kubernetes, you might want to limit the rate of incoming requests to ensure the stability and high availability of your service. Rate limiting can prevent your service from being overwhelmed by too many requests at once, which could lead to resource exhaustion and increased latency. In a Kubernetes environment, you can implement rate limiting at the ingress level.
To do this, we'll use the following resources:
- Ingress: Kubernetes
Ingress
objects manage external access to the services in a cluster, typically HTTP. AnIngress
can provide load balancing, SSL termination, and name-based virtual hosting. - nginx-ingress: We'll use NGINX Ingress Controller to handle the traffic as it provides powerful ways to manage traffic, including rate limiting. NGINX Ingress Controller uses ConfigMap to store the configuration.
Here's how to set up rate-limited ingress for AI model serving with Kubernetes using Pulumi with Python:
- Deploy an
nginx-ingress
controller, configured with rate limiting. We'll define annotations in our ingress resources to use this rate limiting. - Deploy a Kubernetes
Ingress
resource for the service that serves your AI model. It will use thenginx-ingress
controller and its rate limiting configuration.
Below is a Pulumi program that sets up an
nginx-ingress
controller with a simple rate limiting configuration and deploys anIngress
resource with annotations for rate limiting.import pulumi import pulumi_kubernetes as k8s # Deploying the NGINX Ingress Controller using Helm nginx_ingress_controller = k8s.helm.v3.Chart( 'nginx-ingress', k8s.helm.v3.ChartOpts( chart='ingress-nginx', version='3.36.0', namespace='default', fetch_opts=k8s.helm.v3.FetchOpts( repo='https://kubernetes.github.io/ingress-nginx' ), values={ 'controller': { 'config': { 'rate-limiting-enable': 'true', # Define other rate limiting configurations here } } } ) ) # Service to be exposed by the Ingress, assumed to already exist ai_model_service_name = 'ai-model-service' # Deploying an Ingress resource with rate limiting for the AI model service ai_model_ingress = k8s.networking.v1.Ingress( 'ai-model-ingress', metadata=k8s.meta.v1.ObjectMetaArgs( name='ai-model-ingress', annotations={ # Enabling rate limiting on this specific Ingress resource 'nginx.ingress.kubernetes.io/limit-rpm': '30', # This limits to 30 requests per minute # Can also set 'limit-rps' for requests per second }, ), spec=k8s.networking.v1.IngressSpecArgs( ingress_class_name='nginx', # Using the NGINX Ingress class rules=[k8s.networking.v1.IngressRuleArgs( http=k8s.networking.v1.HTTPIngressRuleValueArgs( paths=[k8s.networking.v1.HTTPIngressPathArgs( path='/', path_type='Prefix', backend=k8s.networking.v1.IngressBackendArgs( service=k8s.networking.v1.IngressServiceBackendArgs( name=ai_model_service_name, port=k8s.networking.v1.ServiceBackendPortArgs( number=80, # Port on which the AI model service is listening ), ), ), )], ), )], ), ) # Export the Ingress status as an output - this will include the assigned load balancer IP or hostname pulumi.export('ingress_status', ai_model_ingress.status)
To explain the important parts of this program:
-
We used
pulumi_kubernetes.helm.v3.Chart
to deploy the NGINX Ingress Controller from its Helm chart. Helm is a package manager for Kubernetes, which allows us to deploy applications as a collection of pre-configured Kubernetes resources. -
In the
nginx-ingress
configuration we set'rate-limiting-enable': 'true'
under thecontroller.config
property. This enables rate limiting. The actual rate limiting annotation is then applied per Ingress resource, giving you granular control. -
We created an
Ingress
resource for an existingai-model-service
. The annotation'nginx.ingress.kubernetes.io/limit-rpm': '30'
is used to enforce the rate limiting at 30 requests per minute. -
Finally, we export the status of the ingress, which includes valuable information such as the external IP address or hostname that's been assigned by the ingress controller.
Deploying this Pulumi program sets up an environment where your AI model service can serve requests at a controlled rate, adding a layer of protection against traffic spikes.
- Ingress: Kubernetes