Scalable Inference Serving with Kubernetes

Question

Pulumi · Accepted Answer

To serve a machine learning model using Kubernetes, we need to deploy the model as a microservice, which can be accessed via REST API calls. The service should be scalable to handle varying loads of inference requests, which means that Kubernetes should increase or decrease the number of pods running the service based on demand. A typical setup might include:

1. A Docker container encapsulating your machine learning model and inference code.
2. A Kubernetes Deployment to manage the desired state of your pods.
3. A Kubernetes Service to allow network access to the pods.
4. Horizontal Pod Autoscaling (HPA) to automatically scale the number of pods in response to metrics like CPU utilization or custom metrics.

For demonstration purposes, let's say you have a container image `my-inference-service:latest` that serves predictions on port 8080. Below is a Pulumi program that sets up a scalable inference service using Kubernetes:

- `kubernetes.apps.v1.Deployment`: Defines a deployment object that manages the state of your inference service pods.
- `kubernetes.core.v1.Service`: Exposes your inference service as a network service to receive external traffic.
- `kubernetes.autoscaling.v1.HorizontalPodAutoscaler`: Automatically adjusts the number of pods running based on defined CPU utilization target.

Here is a Pulumi program that accomplishes these goals:

```python
import pulumi
from pulumi_kubernetes import Provider
from pulumi_kubernetes.apps.v1 import Deployment
from pulumi_kubernetes.core.v1 import Service
from pulumi_kubernetes.autoscaling.v1 import HorizontalPodAutoscaler

# Firstly, you create a Kubernetes provider to deploy resources to a specific K8s cluster.
# This assumes that you have a kubeconfig file available on your machine and
# correctly set up to point to your Kubernetes cluster.
k8s_provider = Provider(resource_name='k8s')

# We'll define the deployment of the inference service
inference_deployment = Deployment(
    'inference-deployment',
    # Specification of the deployment
    spec={
        'selector': {'matchLabels': {'app': 'inference-service'}},
        'replicas': 1,  # Start with one replica
        'template': {
            'metadata': {'labels': {'app': 'inference-service'}},
            'spec': {
                'containers': [{
                    'name': 'inference-container',
                    'image': 'my-inference-service:latest',
                    'ports': [{'containerPort': 8080}],
                    # Define resource requirements and limits for the inference container
                    'resources': {
                        'requests': {'cpu': '500m', 'memory': '512Mi'},
                        'limits': {'cpu': '1000m', 'memory': '1024Mi'}
                    }
                }]
            }
        }
    },
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Create a service to expose the inference deployment
inference_service = Service(
    'inference-service',
    spec={
        'type': 'LoadBalancer',  # Use LoadBalancer to expose service externally
        'selector': {'app': 'inference-service'},
        'ports': [{'port': 80, 'targetPort': 8080}]
    },
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Define a HorizontalPodAutoscaler to scale the inference service pods
inference_hpa = HorizontalPodAutoscaler(
    'inference-hpa',
    # HPA spec to scale when the average CPU exceeds 50% of the requested amount
    spec={
        'scaleTargetRef': {
            'apiVersion': 'apps/v1',
            'kind': 'Deployment',
            'name': inference_deployment.metadata['name']
        },
        'minReplicas': 1,
        'maxReplicas': 10,  # Maximum number of replicas to scale out to
        'metrics': [{
            'type': 'Resource',
            'resource': {
                'name': 'cpu',
                'target': {  # Scale if CPU usage exceeds 50%
                    'type': 'Utilization',
                    'averageUtilization': 50
                }
            }
        }]
    },
    opts=pulumi.ResourceOptions(provider=k8s_provider)
)

# Export the inference service's load balancer external IP to access the service
pulumi.export('inference_service_ip', inference_service.status['load_balancer']['ingress'][0]['ip'])
```

This program does the following:
- Initializes a Kubernetes provider to interact with your cluster.
- Defines a Kubernetes deployment that contains your inference application, including the number of replicas and the necessary resources (CPU/memory).
- Creates a Kubernetes service to make your deployment accessible over the network. The service type `LoadBalancer` makes the application available through an external IP.
- Sets up a horizontal pod autoscaler to automatically scale the number of pods up or down based on CPU utilization; it adjusts between a minimum of one and a maximum of ten pods.

After deploying this program, you will get an external IP for the `inference_service_ip`. This IP can be used to send inference requests to your model.

You should tailor the CPU and memory request and limit values, as well as the target utilization percentage, based on the resource requirements and expected load on your inference service.