Serverless Inference Services for LLMs using Cloud Run
PythonTo create serverless inference services for Large Language Models (LLMs) using Google Cloud Run, we will need to:
- Package the inference code along with its dependencies into a container.
- Push the container image to Google Container Registry (GCR) or another container registry that Cloud Run can access.
- Deploy the container to Cloud Run to handle HTTP requests.
Below is a Pulumi program that outlines these steps in Python. It assumes you already have the inference service's code ready and Docker installed to build the container image. Here's the general process we'll follow in the code:
- Use Pulumi's Docker Package to build and push the image to GCR.
- Define a Cloud Run service using the pushed image.
- Configure IAM permissions if necessary, to control access to the Cloud Run service.
- Export the URL of the deployed service.
import pulumi import pulumi_docker as docker import pulumi_gcp as gcp # 1. Build and push the image to Google Container Registry. # Replace `./app` with the path to your application's directory that contains the Dockerfile. app_image = docker.Image("app-image", image_name=gcp.config.project.apply(lambda project: f"gcr.io/{project}/app-image"), build=docker.DockerBuild(context="./app"), registry=docker.ImageRegistry( server="gcr.io", username="_json_key", # Here we assume your GCP credentials are stored in JSON format and Pulumi is configured to use them. password=pulumi.config.require("gcp-key-json") ) ) # 2. Deploy the container to Cloud Run. cloud_run_service = gcp.cloudrun.Service("cloud-run-service", location="us-central1", # Choose the appropriate region for your service. template=gcp.cloudrun.ServiceTemplateArgs( spec=gcp.cloudrun.ServiceTemplateSpecArgs( containers=[gcp.cloudrun.ServiceTemplateSpecContainerArgs( image=app_image.image_name, )], # You may specify the number of maximum requests per container here # to optimize for concurrency and reduce cold starts. container_concurrency=80, ), ), ) # Set the IAM policy to allow unauthenticated access to the Cloud Run service iam_policy = gcp.cloudrun.IamPolicy("cloud-run-iam-policy", location=cloud_run_service.location, project=cloud_run_service.project, service=cloud_run_service.name, bindings=[gcp.cloudrun.IamPolicyBindingArgs( role="roles/run.invoker", members=["allUsers"], )] ) # 3. Export the URL of the deployed service. pulumi.export("service_url", cloud_run_service.statuses[0].url)
This code first builds a Docker image given a path to a directory with a
Dockerfile
and application code. It tags the image appropriately for Google Container Registry (GCR) and then pushes it to GCR. Make sure the GCR is enabled in your GCP project.Next, it creates a new Cloud Run service referencing the Docker image we pushed to GCR. The
container_concurrency
parameter is optional and can be adjusted based on the expected request load. A lower value will create more instances of your container for handling requests, which can reduce the latency due to cold starts.Finally, we set the IAM policy to allow public access to the Cloud Run service using role
roles/run.invoker
and the special memberallUsers
, which denotes that anyone can access the service. If you want to restrict access to certain authenticated users, you'll need to specify those users or service accounts in themembers
list.Remember to replace
"us-central1"
with the region you want your Cloud Run services to be deployed.The exported URL
service_url
is where your service can be accessed once it's deployed.To correctly authenticate with Google Cloud, ensure your environment is set up with proper credentials and that Pulumi is configured to use your GCP project – usually by setting
gcp:project
,gcp:region
, andgcp:zone
configuration variables. Thegcp-key-json
should contain the content of your service account key file in JSON format.