Scaling Ingress for Multi-Tenant AI Services
PythonTo scale Ingress for multi-tenant AI services in Kubernetes, you need to consider a scalable ingress controller, resource limits, and optionally, Horizontal Pod Autoscaling (HPA) to adjust the number of ingress controller replicas based on load.
The NGINX Ingress Controller is a popular choice that can be scaled to meet the demands of multi-tenant services. We will use the NGINX Ingress Controller along with a Kubernetes Horizontal Pod Autoscaler to enable automatic scaling based on CPU or memory usage.
Below is a Pulumi program that defines:
- An NGINX Ingress Controller.
- A Horizontal Pod Autoscaler targeting the NGINX Ingress Controller's deployment.
- The Ingress resource defining routing rules for tenants.
The Pulumi program uses the
pulumi_kubernetes
package to communicate with your Kubernetes cluster.Here is the code:
import pulumi import pulumi_kubernetes as kubernetes # Create a Kubernetes namespace for the Ingress controller. ingress_namespace = kubernetes.core.v1.Namespace("ingress-namespace", metadata={"name": "nginx-ingress"} ) # Deploy the NGINX Ingress Controller using the Helm chart. nginx_ingress_controller = kubernetes.helm.v3.Chart("nginx-ingress", kubernetes.helm.v3.ChartOpts( chart="ingress-nginx", version="3.30.0", # Choose a version of the Helm chart that suits your requirements. namespace=ingress_namespace.metadata["name"], fetch_opts=kubernetes.helm.v3.FetchOpts( repo="https://kubernetes.github.io/ingress-nginx" ), values={ "controller": { "replicaCount": 2, # Start with 2 replicas, HPA will adjust this as needed. "resources": { "requests": { "cpu": "100m", "memory": "90Mi" }, "limits": { "cpu": "200m", "memory": "180Mi" } } } } ), opts=pulumi.ResourceOptions(namespace=ingress_namespace) ) # Deploy the Horizontal Pod Autoscaler targeting the NGINX Ingress Controller. nginx_ingress_hpa = kubernetes.autoscaling.v2beta1.HorizontalPodAutoscaler("nginx-ingress-hpa", spec=kubernetes.autoscaling.v2beta1.HorizontalPodAutoscalerSpecArgs( scale_target_ref=kubernetes.autoscaling.v2beta1.CrossVersionObjectReferenceArgs( api_version="apps/v1", kind="Deployment", name="nginx-ingress-controller" # Make sure this matches the name in the Helm chart values. ), min_replicas=2, max_replicas=10, # Scale up to 10 replicas. metrics=[ kubernetes.autoscaling.v2beta1.MetricSpecArgs( type="Resource", resource=kubernetes.autoscaling.v2beta1.ResourceMetricSourceArgs( name="cpu", target_average_utilization=50 # Target 50% CPU utilization before scaling. ) ) ] ), opts=pulumi.ResourceOptions(namespace=ingress_namespace) ) # Create an Ingress resource that defines routing rules for tenants. # This is a simple example showing routing based on the host. Adjust according to your needs. tenant_ingress = kubernetes.networking.v1.Ingress("tenant-ingress", metadata=kubernetes.meta.v1.ObjectMetaArgs( namespace=ingress_namespace.metadata["name"], annotations={ "kubernetes.io/ingress.class": "nginx" # Use the NGINX Ingress class. } ), spec=kubernetes.networking.v1.IngressSpecArgs( rules=[ kubernetes.networking.v1.IngressRuleArgs( host="tenant1.example.com", # Replace with your tenant's domain. http=kubernetes.networking.v1.HTTPIngressRuleValueArgs( paths=[ kubernetes.networking.v1.HTTPIngressPathArgs( path="/", path_type="Prefix", backend=kubernetes.networking.v1.IngressBackendArgs( service=kubernetes.networking.v1.IngressServiceBackendArgs( name="tenant1-service", # Replace with your tenant's service name. port=kubernetes.networking.v1.ServiceBackendPortArgs( number=80 # The port your service is exposed on inside the cluster. ) ) ) ) ] ) ) # Add more rules for additional tenants. ] ), opts=pulumi.ResourceOptions(namespace=ingress_namespace) ) # Export the Ingress IP address if using a cloud provider that provisions an IP for the Ingress. # Otherwise, use `kubectl get ingress` to explore the status. ingress_ip = tenant_ingress.status.load_balancer.ingress[0].ip pulumi.export('ingress_ip', ingress_ip)
With this Pulumi program, you can now deploy an NGINX Ingress Controller that is capable of scaling with the load generated by AI services. The HPA will monitor the CPU utilization of the ingress controller pods and scale the number of replicas accordingly, within the bounds of the specified minimum and maximum number of replicas. Each tenant's traffic is routed to the correct service based on the host in the
Ingress
resource, allowing for a multi-tenant setup. Each new tenant can be added by including additionalIngressRuleArgs
with the appropriate host and service information.