Serverless Inference Endpoints with Cloud Run

Question

Pulumi · Accepted Answer

To create serverless endpoints for inference with Google Cloud Run, you would typically need to do the following:

1. **Deploy a Container**: Package your machine learning model and inference code into a container and deploy it to Cloud Run. Google Cloud Run is a fully managed compute platform that automatically scales your stateless containers.

2. **Expose Endpoints**: Once deployed, Cloud Run will provide you with a URL to access your service.

3. **Manage Permissions**: Optionally, you may need to manage permissions for your Cloud Run service to control who can invoke your endpoints.

For this program, we will:

- Use the `google-native.run/v1.Service` resource because it allows us to deploy a new service to Cloud Run.
- We will assume you have a Docker container image ready with your inference code and model, which is hosted on Google Container Registry or any other container image registry that Cloud Run can access.

Here's a Pulumi program for deploying a serverless inference endpoint using Cloud Run:

```python
import pulumi
import pulumi_google_native as google_native

# Configuration for the Cloud Run service
project = 'my-gcp-project'
location = 'us-central1'
image = 'gcr.io/my-gcp-project/my-inference-image'  # Replace with your container image
service_name = 'my-inference-service'

# Create a Cloud Run service
inference_service = google_native.run.v1.Service(
    service_name,
    metadata=google_native.run.v1.ObjectMetaArgs(
        name=service_name,
        namespace=project,
    ),
    spec=google_native.run.v1.ServiceSpecArgs(
        template=google_native.run.v1.RevisionTemplateArgs(
            spec=google_native.run.v1.RevisionSpecArgs(
                containers=[google_native.run.v1.ContainerArgs(
                    image=image,
                )],
            ),
        ),
    ),
    location=location,
    project=project,
)

# Export the URL of the service
pulumi.export('service_url', inference_service.status.url)
```

In this program:

- We import the required Pulumi modules for Google Cloud.
- We define the configuration, such as the project ID, location, image for the inference server, and the service name.
- We create a Cloud Run service by defining a `Service` object with the necessary metadata and specifications, including the container image to deploy.
- Lastly, we export the URL of the deployed service, which can be used to make inference requests.

After running `pulumi up` with this program, the output will include the service URL. You can then send HTTP requests to this URL to perform inference using your machine learning model.