1. Scalable Inference Serving with Kubernetes


    To serve a machine learning model using Kubernetes, we need to deploy the model as a microservice, which can be accessed via REST API calls. The service should be scalable to handle varying loads of inference requests, which means that Kubernetes should increase or decrease the number of pods running the service based on demand. A typical setup might include:

    1. A Docker container encapsulating your machine learning model and inference code.
    2. A Kubernetes Deployment to manage the desired state of your pods.
    3. A Kubernetes Service to allow network access to the pods.
    4. Horizontal Pod Autoscaling (HPA) to automatically scale the number of pods in response to metrics like CPU utilization or custom metrics.

    For demonstration purposes, let's say you have a container image my-inference-service:latest that serves predictions on port 8080. Below is a Pulumi program that sets up a scalable inference service using Kubernetes:

    • kubernetes.apps.v1.Deployment: Defines a deployment object that manages the state of your inference service pods.
    • kubernetes.core.v1.Service: Exposes your inference service as a network service to receive external traffic.
    • kubernetes.autoscaling.v1.HorizontalPodAutoscaler: Automatically adjusts the number of pods running based on defined CPU utilization target.

    Here is a Pulumi program that accomplishes these goals:

    import pulumi from pulumi_kubernetes import Provider from pulumi_kubernetes.apps.v1 import Deployment from pulumi_kubernetes.core.v1 import Service from pulumi_kubernetes.autoscaling.v1 import HorizontalPodAutoscaler # Firstly, you create a Kubernetes provider to deploy resources to a specific K8s cluster. # This assumes that you have a kubeconfig file available on your machine and # correctly set up to point to your Kubernetes cluster. k8s_provider = Provider(resource_name='k8s') # We'll define the deployment of the inference service inference_deployment = Deployment( 'inference-deployment', # Specification of the deployment spec={ 'selector': {'matchLabels': {'app': 'inference-service'}}, 'replicas': 1, # Start with one replica 'template': { 'metadata': {'labels': {'app': 'inference-service'}}, 'spec': { 'containers': [{ 'name': 'inference-container', 'image': 'my-inference-service:latest', 'ports': [{'containerPort': 8080}], # Define resource requirements and limits for the inference container 'resources': { 'requests': {'cpu': '500m', 'memory': '512Mi'}, 'limits': {'cpu': '1000m', 'memory': '1024Mi'} } }] } } }, opts=pulumi.ResourceOptions(provider=k8s_provider) ) # Create a service to expose the inference deployment inference_service = Service( 'inference-service', spec={ 'type': 'LoadBalancer', # Use LoadBalancer to expose service externally 'selector': {'app': 'inference-service'}, 'ports': [{'port': 80, 'targetPort': 8080}] }, opts=pulumi.ResourceOptions(provider=k8s_provider) ) # Define a HorizontalPodAutoscaler to scale the inference service pods inference_hpa = HorizontalPodAutoscaler( 'inference-hpa', # HPA spec to scale when the average CPU exceeds 50% of the requested amount spec={ 'scaleTargetRef': { 'apiVersion': 'apps/v1', 'kind': 'Deployment', 'name': inference_deployment.metadata['name'] }, 'minReplicas': 1, 'maxReplicas': 10, # Maximum number of replicas to scale out to 'metrics': [{ 'type': 'Resource', 'resource': { 'name': 'cpu', 'target': { # Scale if CPU usage exceeds 50% 'type': 'Utilization', 'averageUtilization': 50 } } }] }, opts=pulumi.ResourceOptions(provider=k8s_provider) ) # Export the inference service's load balancer external IP to access the service pulumi.export('inference_service_ip', inference_service.status['load_balancer']['ingress'][0]['ip'])

    This program does the following:

    • Initializes a Kubernetes provider to interact with your cluster.
    • Defines a Kubernetes deployment that contains your inference application, including the number of replicas and the necessary resources (CPU/memory).
    • Creates a Kubernetes service to make your deployment accessible over the network. The service type LoadBalancer makes the application available through an external IP.
    • Sets up a horizontal pod autoscaler to automatically scale the number of pods up or down based on CPU utilization; it adjusts between a minimum of one and a maximum of ten pods.

    After deploying this program, you will get an external IP for the inference_service_ip. This IP can be used to send inference requests to your model.

    You should tailor the CPU and memory request and limit values, as well as the target utilization percentage, based on the resource requirements and expected load on your inference service.