Scalable ML Model Serving with GKE

Question

Pulumi · Accepted Answer

To serve a scalable machine learning (ML) model on Google Kubernetes Engine (GKE), you need to perform the following steps:

Create a GKE cluster where your ML model will be deployed and served.
Define a deployment that references your ML model's Docker image. This image should include the code for your model as well as a web server (like Flask or FastAPI) that exposes an API endpoint for model predictions.
Use Horizontal Pod Autoscaler to automatically scale your deployment up and down based on traffic.

Below is a program written in Python using Pulumi to provision a GKE cluster and set up a deployment and service for serving an ML model.

import pulumi
import pulumi_gcp as gcp

# Create a GKE cluster.
cluster = gcp.container.Cluster("ml-cluster",
    initial_node_count=3,
    node_config=gcp.container.ClusterNodeConfigArgs(
        machine_type="n1-standard-1",
        oauth_scopes=[
            "https://www.googleapis.com/auth/compute",
            "https://www.googleapis.com/auth/devstorage.read_only",
            "https://www.googleapis.com/auth/logging.write",
            "https://www.googleapis.com/auth/monitoring",
        ],
    ),
)

# Deploy your ML Model using a Kubernetes Deployment.
ml_deployment = gcp.kubernetes.yaml.ConfigFile("ml-deployment",
    file="ml-model-deployment.yaml",
    resource_name_prefix=cluster.name.apply(lambda name: f"{name}-"),
    opts=pulumi.ResourceOptions(
        depends_on=[cluster], # Ensure the cluster is created before deploying the model.
    ),
)

# Create a Kubernetes Service to expose the ML model deployment.
ml_service = gcp.kubernetes.yaml.ConfigFile("ml-service",
    file="ml-model-service.yaml",
    resource_name_prefix=cluster.name.apply(lambda name: f"{name}-"),
    opts=pulumi.ResourceOptions(
        depends_on=[ml_deployment], # Ensure the deployment is created before exposing it with a service.
    ),
)

# Export the GKE cluster name and endpoint.
pulumi.export("cluster_name", cluster.name)
pulumi.export("cluster_endpoint", cluster.endpoint)

# Replace the placeholders in the 'ml-model-deployment.yaml':
#   - `<your-docker-image>` with the URL of your ML model's Docker image.
#   - `<your-container-port>` with the port on which your web server is listening inside the container.

# Replace the placeholders in the 'ml-model-service.yaml':
#   - `<your-service-port>` with the port on which you want to expose your service.
#   - `<target-port>` with the port your application is set to receive traffic on.

Before running this program, you should have two YAML files: ml-model-deployment.yaml and ml-model-service.yaml, which define the Kubernetes Deployment and Service, respectively, for your ML model.

Here's an example of what ml-model-deployment.yaml might look like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: ml-model
        image: <your-docker-image>
        ports:
        - containerPort: <your-container-port>

And ml-model-service.yaml would look like this:

apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  type: LoadBalancer
  selector:
    app: ml-model
  ports:
  - protocol: TCP
    port: <your-service-port>
    targetPort: <target-port>

Make sure to replace the placeholder values with the actual details of your ML model and Docker configuration. These files will be applied to the cluster to create the necessary Kubernetes resources.

The program uses Pulumi's GCP package to set up a GKE cluster (pulumi_gcp.container.Cluster) with a specified number of nodes and node configuration including machine type and OAuth scopes necessary for GKE.

Then, Pulumi applies Kubernetes YAML configurations (pulumi_gcp.kubernetes.yaml.ConfigFile) for deploying the ML model and exposing it via a Kubernetes Service. The resource_name_prefix parameter ensures that the name of the resources created will be prefixed with the name of the GKE cluster to avoid name clashes.

After applying this program with Pulumi, you will have a scalable infrastructure ready to serve your ML model with a simple API endpoint. The GKE cluster's scalability options and the Kubernetes Horizontal Pod Autoscaler will ensure that the service scales as needed based on demand.