Orchestrating Tensorflow Serving with Kubernetes for Model Inference

Question

Pulumi · Accepted Answer

To orchestrate TensorFlow Serving with Kubernetes for model inference, we will create a Kubernetes deployment that runs TensorFlow Serving, exposing it with a service for inference requests. Here are the steps involved:

1. **Create a TensorFlow Serving Docker Image**: Package your trained TensorFlow model by including it in a TensorFlow Serving Docker image. This image will be used to create containers in the Kubernetes cluster.

2. **Deploy with Kubernetes**: Write a Pulumi program to define a Kubernetes Deployment resource that will manage TensorFlow Serving containers. This deployment will ensure the desired state of running TensorFlow Serving containers is maintained.

3. **Expose the Service**: Define a Kubernetes Service resource to expose the TensorFlow Serving deployment to be accessible for inference requests.

4. **Optional - Ingress**: If external access is required, create an Ingress resource to expose your service to the internet.

Let's write the Pulumi program:

```python
import pulumi
import pulumi_kubernetes as k8s

# Configuration

# The Docker image for TensorFlow Serving that includes your trained model.
# Replace this with the image you have prepared.
tf_serving_image = "your-docker-hub-user/tensorflow-serving:latest"

# Kubernetes Deployment for TensorFlow Serving
tf_deployment = k8s.apps.v1.Deployment(
    "tf-serving-deployment",
    spec=k8s.apps.v1.DeploymentSpecArgs(
        replicas=1, # Number of pods to run
        selector=k8s.meta.v1.LabelSelectorArgs(
            match_labels={"app": "tensorflow-serving"},
        ),
        template=k8s.core.v1.PodTemplateSpecArgs(
            metadata=k8s.meta.v1.ObjectMetaArgs(
                labels={"app": "tensorflow-serving"},
            ),
            spec=k8s.core.v1.PodSpecArgs(
                containers=[
                    k8s.core.v1.ContainerArgs(
                        name="tensorflow-serving",
                        image=tf_serving_image,
                        ports=[k8s.core.v1.ContainerPortArgs(container_port=8501)], # Exposing port 8501 for REST API access
                        # Add readiness and liveness probes as needed
                    ),
                ],
            ),
        ),
    ))

# Kubernetes Service that exposes TensorFlow Serving
tf_service = k8s.core.v1.Service(
    "tf-serving-service",
    spec=k8s.core.v1.ServiceSpecArgs(
        selector={"app": "tensorflow-serving"},
        ports=[k8s.core.v1.ServicePortArgs(
            port=8501,  # Port that the service will serve on
            target_port=8501,  # Target port of the TensorFlow Serving container
        )],
        type="LoadBalancer",  # If you're running on a cloud provider that supports LoadBalancers use this type.
    ))

# Export the Service name and endpoint
pulumi.export("tf_serving_service_name", tf_service.metadata["name"])
pulumi.export("tf_serving_service_url", tf_service.status["load_balancer"]["ingress"][0]["ip"].apply(
    lambda ip: f"http://{ip}:8501"))  # This will be the endpoint for sending inference requests
```

This Pulumi program defines a Kubernetes deployment (`tf-serving-deployment`) with the specified image from your Docker hub containing the TensorFlow model. The deployment will maintain a single replica (change this number to scale as needed).

The program next defines a Kubernetes service (`tf-serving-service`) that exposes the TensorFlow Serving REST API on port 8501 using a LoadBalancer type, which is common on cloud-managed Kubernetes services like GKE, EKS, and AKS to expose the service to the internet.

After deploying this with `pulumi up`, you should have a public IP address that you can use to send inference requests to TensorFlow Serving.

Make sure to replace `your-docker-hub-user/tensorflow-serving:latest` with the actual Docker image path for your TensorFlow model. If your cloud provider does not support LoadBalancer services, you will have to modify the Service type and potentially add an Ingress to expose the service to the internet.

Keep in mind that readiness and liveness probes are recommended for production deployments to increase the reliability of your service. Add them under the container specification for TensorFlow Serving as needed.