Serverless Inference with GCP Cloud Run and TensorFlow.
PythonTo create a serverless inference with GCP Cloud Run and TensorFlow, we are going to follow these high-level steps:
-
Containerize the TensorFlow Model: Package your trained TensorFlow model into a Docker container that exposes the model’s inference capabilities as a REST API.
-
Push the Container to Container Registry: Push your Docker container image to Google Container Registry so that it can be deployed to Cloud Run.
-
Deploy to Cloud Run: Create a Cloud Run service that pulls the container image from the Container Registry and runs it.
For the purpose of the Pulumi program, I will assume you have already containerized your TensorFlow model. Also, GCP (Google Cloud) credentials are assumed to be set up appropriately for Pulumi to interact with your GCP account. The following Pulumi program is written in Python and showcases how to deploy the containerized model on Cloud Run:
- Import the necessary GCP module for Cloud Run.
- Define the Cloud Run service using the
Service
class. - Set up necessary permissions if needed.
Here's the Pulumi program that performs these steps:
import pulumi import pulumi_gcp as gcp # Replace 'PROJECT_ID' with your GCP project ID and 'IMAGE_URL' with the URL of your container image PROJECT_ID = 'your-gcp-project-id' CONTAINER_IMAGE_URL = 'gcr.io/your-project-id/your-tensorflow-model-container' # Create a Cloud Run service tensorflow_service = gcp.cloudrun.Service("tensorflow-service", location="us-central1", template=gcp.cloudrun.ServiceTemplateArgs( spec=gcp.cloudrun.ServiceSpecArgs( containers=[ gcp.cloudrun.ServiceSpecContainerArgs( image=CONTAINER_IMAGE_URL, ports=[gcp.cloudrun.ServiceSpecContainerPortArgs( container_port=8080 )] ) ], # A timeout seems useful in case the prediction/inference takes too long and should be configurable. timeout_seconds=300, )), metadata=gcp.cloudrun.ServiceMetadataArgs( # 'PROJECT_ID' here will be taken from predefined variable. namespace=PROJECT_ID, ), traffic=[gcp.cloudrun.ServiceTrafficArgs( percent=100, # 'latest_revision' will always route traffic to the most recent revision. latest_revision=True )], project=PROJECT_ID, autogenerate_revision_name=True ) # Export the URL of the Cloud Run service pulumi.export('url', pulumi.Output.concat('https://', tensorflow_service.statuses[0].url))
Here's what each section of the code is doing:
-
The
pulumi_gcp
Python package is being used to create resources on Google Cloud Platform. -
The
gcp.cloudrun.Service
class defines a new managed Cloud Run service. It expects parameters likelocation
for where to deploy the service,template
to describe the pod that runs on Cloud Run,metadata
to provide additional information like the namespace, andtraffic
to configure how incoming requests are routed. -
The
template
parameter is particularly important here, as it's where you define the container images and their configurations, including theimage
property with the path to the container in Google Container Registry and theports
property to configure the container port your application listens on. -
The
gcp.cloudrun.ServiceTrafficArgs
specifies how to route traffic to revisions of this service. Here, we are routing 100% of the traffic to the latest revision withlatest_revision=True
. -
The exported URL
pulumi.export('url', pulumi.Output.concat('https://', tensorflow_service.statuses[0].url))
is the endpoint where you can interact with the deployed TensorFlow model for inference requests.
We just created a Cloud Run service that can be used to perform serverless inference with a TensorFlow model. This model is highly scalable and requires no server management, making it convenient and cost-effective for machine learning inference workloads.
-