Load Balancing for AI Model Serving on Kubernetes

Question

Pulumi · Accepted Answer

When designing a solution for serving AI models on Kubernetes, you generally set it up in a way that can handle a high volume of requests, provide low latency, and can scale to accommodate varying loads. To achieve this, load balancing is an essential component. It helps distribute incoming AI model inference requests across a pool of available serving pods.

In Kubernetes, this load balancing is typically achieved with a combination of Services and Ingress controllers:

- **Kubernetes Service**: A Service in Kubernetes functions as an internal load balancer. It provides a single access point for clients to access one or more pods that host your AI model. These pods can be scaled horizontally, and the Service will automatically balance the traffic across all available pods. The Service object can be of different types, such as ClusterIP (for internal access), NodePort (which exposes the Service on each Node's IP at a static port), and LoadBalancer (which provisions an external load balancer for you).

- **Ingress**: When you need to expose your Service to the external traffic, you use an Ingress, which manages external access to the services in a cluster, typically via HTTP/HTTPS. Ingress can also provide load balancing, SSL termination, and name-based virtual hosting.

Below is an example Pulumi program using Python that sets up a Kubernetes Service and Ingress resource that could be used for serving AI models. This program assumes that you have a Deployment already running that serves your AI model.

First, let's define the Kubernetes Service that internally load balances requests to your serving pods:

```python
import pulumi
import pulumi_kubernetes as k8s

# A service is created to expose your AI model serving pods within the cluster
model_service = k8s.core.v1.Service(
    "ai-model-service",
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="ai-model-service",
    ),
    spec=k8s.core.v1.ServiceSpecArgs(
        selector={"app": "ai-model"},  # Assumes your pods have `app: ai-model` label
        ports=[k8s.core.v1.ServicePortArgs(
            port=80,
            target_port=8080,  # Port on which your pods are serving the AI model
        )],
        type="ClusterIP"  # Internal cluster IP
    )
)

# Reference for ServiceSpec arguments: https://www.pulumi.com/registry/packages/kubernetes/api-docs/core/v1/servicespec/

# Outputs the service name
pulumi.export('service_name', model_service.metadata.apply(lambda m: m.name))
```

Next, we'll define the Ingress resource to expose the Service to external traffic. We are going to use the NGINX Ingress controller, which is a popular choice in the Kubernetes community:

```python
# An ingress is used to expose the service to the outside world for HTTP(S) access.
model_ingress = k8s.networking.v1.Ingress(
    "ai-model-ingress",
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name="ai-model-ingress",
        annotations={
            "nginx.ingress.kubernetes.io/rewrite-target": "/"  # NGINX specific annotation 
        },
    ),
    spec=k8s.networking.v1.IngressSpecArgs(
        rules=[k8s.networking.v1.IngressRuleArgs(
            http=k8s.networking.v1.HTTPIngressRuleValueArgs(
                paths=[k8s.networking.v1.HTTPIngressPathArgs(
                    path="/model",  # External path to access your service
                    path_type="Prefix",
                    backend=k8s.networking.v1.IngressBackendArgs(
                        service=k8s.networking.v1.IngressServiceBackendArgs(
                            name=model_service.metadata.apply(lambda m: m.name),
                            port=k8s.networking.v1.ServiceBackendPortArgs(number=80),
                        )
                    )
                )]
            )
        )]
    )
)

# Reference for IngressSpec arguments: https://www.pulumi.com/registry/packages/kubernetes/api-docs/networking.v1/ingressspec/

# Exports the ingress endpoint to access the AI model service.
pulumi.export('ingress_endpoint', model_ingress.status.apply(lambda s: s.load_balancer.ingress[0].ip if s.load_balancer.ingress and len(s.load_balancer.ingress) > 0 else 'Not assigned'))
```

This Pulumi program defines two resources:
1. A `Service` to load balance requests across pods matching the `app: ai-model` label.
2. An `Ingress` to manage external traffic routing to your `Service`.

Clients can send requests to the `/model` path, and the NGINX Ingress controller will route those requests to your AI model's `Service`, which in turn load balances them across available pods.

Make sure you have the NGINX Ingress controller installed in your cluster. If it is not installed, you will need to install it before applying the Ingress resource. The annotations in the `Ingress` metadata are specific to NGINX and direct it to rewrite the path used by the client to the root path `/` when routing traffic to your pods. The `rewrite-target` annotation is an Ingress NGINX feature that alters the path of the request before it gets to your pod.

Don't forget that your serving pods need to be labeled with `app: ai-model` to match the selector in the `Service` definition. The ports configuration assumes that your pods use port 8080 to serve the AI model and that port 80 is used for external traffic. Adjust these ports according to your deployment.

After applying this Pulumi program, the resulting infrastructure will automatically load balance incoming requests to your AI model serving endpoints, and you will be able to access the service externally at the IP address provided by the Ingress resource.