Serverless Inference with GCP Cloud Run and TensorFlow.

Question

Pulumi · Accepted Answer

To create a serverless inference with GCP Cloud Run and TensorFlow, we are going to follow these high-level steps:

1. **Containerize the TensorFlow Model**: Package your trained TensorFlow model into a Docker container that exposes the model’s inference capabilities as a REST API.

2. **Push the Container to Container Registry**: Push your Docker container image to Google Container Registry so that it can be deployed to Cloud Run.

3. **Deploy to Cloud Run**: Create a Cloud Run service that pulls the container image from the Container Registry and runs it.

For the purpose of the Pulumi program, I will assume you have already containerized your TensorFlow model. Also, GCP (Google Cloud) credentials are assumed to be set up appropriately for Pulumi to interact with your GCP account. The following Pulumi program is written in Python and showcases how to deploy the containerized model on Cloud Run:

1. Import the necessary GCP module for Cloud Run.
2. Define the Cloud Run service using the `Service` class.
3. Set up necessary permissions if needed.

Here's the Pulumi program that performs these steps:

```python
import pulumi
import pulumi_gcp as gcp

# Replace 'PROJECT_ID' with your GCP project ID and 'IMAGE_URL' with the URL of your container image
PROJECT_ID = 'your-gcp-project-id'
CONTAINER_IMAGE_URL = 'gcr.io/your-project-id/your-tensorflow-model-container'

# Create a Cloud Run service
tensorflow_service = gcp.cloudrun.Service("tensorflow-service",
    location="us-central1",
    template=gcp.cloudrun.ServiceTemplateArgs(
      spec=gcp.cloudrun.ServiceSpecArgs(
        containers=[
            gcp.cloudrun.ServiceSpecContainerArgs(
                image=CONTAINER_IMAGE_URL,
                ports=[gcp.cloudrun.ServiceSpecContainerPortArgs(
                    container_port=8080
                )]
            )
        ],
        # A timeout seems useful in case the prediction/inference takes too long and should be configurable.
        timeout_seconds=300,
    )),
    metadata=gcp.cloudrun.ServiceMetadataArgs(
        # 'PROJECT_ID' here will be taken from predefined variable.
        namespace=PROJECT_ID,
    ),
    traffic=[gcp.cloudrun.ServiceTrafficArgs(
        percent=100,
        # 'latest_revision' will always route traffic to the most recent revision.
        latest_revision=True
    )],
    project=PROJECT_ID,
    autogenerate_revision_name=True
)

# Export the URL of the Cloud Run service
pulumi.export('url', pulumi.Output.concat('https://', tensorflow_service.statuses[0].url))
```

Here's what each section of the code is doing:

- The `pulumi_gcp` Python package is being used to create resources on Google Cloud Platform.

- The `gcp.cloudrun.Service` class defines a new managed Cloud Run service. It expects parameters like `location` for where to deploy the service, `template` to describe the pod that runs on Cloud Run, `metadata` to provide additional information like the namespace, and `traffic` to configure how incoming requests are routed.

- The `template` parameter is particularly important here, as it's where you define the container images and their configurations, including the `image` property with the path to the container in Google Container Registry and the `ports` property to configure the container port your application listens on.

- The `gcp.cloudrun.ServiceTrafficArgs` specifies how to route traffic to revisions of this service. Here, we are routing 100% of the traffic to the latest revision with `latest_revision=True`.

- The exported URL `pulumi.export('url', pulumi.Output.concat('https://', tensorflow_service.statuses[0].url))` is the endpoint where you can interact with the deployed TensorFlow model for inference requests.

We just created a Cloud Run service that can be used to perform serverless inference with a TensorFlow model. This model is highly scalable and requires no server management, making it convenient and cost-effective for machine learning inference workloads.