Serverless Model Inference Endpoints with GCP Cloud Functions

Question

Pulumi · Accepted Answer

Serverless computing allows you to build and run applications without thinking about servers. With serverless applications, you can build backend services that automatically scale with the number of requests and only incur costs when the code is running.

Google Cloud Functions is a serverless execution environment on the Google Cloud Platform (GCP). It allows you to run your code in response to events such as HTTP requests or pub/sub events triggered from other GCP services. This makes it an excellent platform for deploying machine learning model inference endpoints where you can invoke your models via HTTP requests.

In this guide, we'll create a Google Cloud Function deployment that serves as a serverless model inference endpoint. We will define the cloud function resource using Pulumi with Python, and you'll see how this can automatically handle deploying the code as well as defining the necessary triggers.

Here's an overview of the steps we're going to take:

1. Define the Google Cloud Function that will serve the model inference.
2. Specify the source code for the function, which can be placed in a cloud storage bucket or directly inlined, depending on preference.
3. Set up an HTTP trigger so the function can be invoked via HTTP requests.

Let's start with the Pulumi program in Python:

```python
import pulumi
import pulumi_gcp as gcp

# Define the Cloud Storage Bucket where the function's source code will be stored.
bucket = gcp.storage.Bucket('model-inference-bucket')

# Upload the source code to the Cloud Storage Bucket.
source_archive_object = gcp.storage.BucketObject(
    'source-code',
    bucket=bucket.name,
    source=pulumi.FileArchive('./source')  # Directory with function's source code.
)

# Define the Cloud Function resource.
# Replace 'ENTRY_POINT' with the name of the function in your code that handles inference.
cloud_function = gcp.cloudfunctions.Function(
    'model-inference-endpoint',
    source_archive_bucket=bucket.name,
    runtime="python39",  # Specify the runtime, use the one that matches your function code.
    source_archive_object=source_archive_object.name,
    entry_point='ENTRY_POINT',  # Name of the model inference function in your code.
    trigger_http=True,  # Enable HTTP trigger.
    available_memory_mb=128  # Adjust memory based on your function's requirements.
    # Any additional configurations such as environment variables, VPC settings, etc. can be added here.
)

# Export the function's endpoint URL so it can be accessed.
pulumi.export('function_url', cloud_function.https_trigger_url)
```

In the above program:

- We create a Cloud Storage bucket to store the function's code. This bucket will hold a zipped archive containing your Python code that contains the logic for model inference.
- We upload the source code for the function from a local directory `./source` to the Cloud Storage bucket.
- We define the Cloud Function with the relevant configuration and specify HTTP as the trigger, which allows us to invoke the function via HTTP requests.
- The runtime is set to Python 3.9 (`python39`), which your function's code should be compatible with.
- Replace `ENTRY_POINT` with the function name that you have in your source code that will handle the inference requests.
- We export the `https_trigger_url` so that you can call the function once it's deployed.

Make sure that the zipped source code placed in the `./source` directory contains all required dependencies along with your main function code.

Once deployed, you can send an HTTP request to the output URL, which will trigger your Cloud Function to run your model inference logic and respond accordingly.