Load-Balanced TensorFlow Serving with GCP BackendService.

Question

Pulumi · Accepted Answer

To create a load-balanced TensorFlow Serving setup on Google Cloud Platform (GCP), you will need multiple components working together. Here's how they interact:

1. **TensorFlow Model Servers**: These servers will run the TensorFlow Serving application, which serves your machine learning models.

2. **Instance Groups**: The TensorFlow model servers will be grouped into managed instance groups for scalability and reliability.

3. **Backend Service**: This will manage traffic distribution to the instance groups, and includes health checks to ensure traffic is only sent to healthy instances.

4. **URL Map and HTTP(S) Proxy**: Directs incoming requests to the appropriate backend service based on the path in the request.

5. **Global Forwarding Rule and External IP**: The entry point for all incoming traffic, which routes requests to the HTTP(S) proxy.
   
I'll walk you through a basic example of setting up these components using the Pulumi SDK for GCP.

**Note**: Before you begin, make sure you have Google Cloud SDK and Pulumi CLI installed and properly configured to interact with your GCP account.

The Pulumi program below will use the following resources:

- `gcp.compute.InstanceGroupManager`: To create a managed instance group for the TensorFlow Serving servers.
- `gcp.compute.HealthCheck` and `gcp.compute.BackendService`: For health checks and to manage backend services.
- `gcp.compute.URLMap`, `gcp.compute.TargetHttpProxy`, and `gcp.compute.GlobalForwardingRule`: To manage traffic routing.
- `gcp.compute.Address`: To reserve an external IP address.

```python
import pulumi
import pulumi_gcp as gcp

# Create a health check to verify that the instances are responsive
health_check = gcp.compute.HealthCheck("tf-serving-health-check",
    check_interval_sec=5,
    timeout_sec=5,
    tcp_health_check=gcp.compute.HealthCheckTcpHealthCheckArgs(port=8501),
    description="Health check for TensorFlow Serving instances")

# Define the instance template for the TensorFlow Serving servers
instance_template = gcp.compute.InstanceTemplate("tf-serving-template",
    description="Template for TensorFlow Serving instances",
    machine_type="n1-standard-1",
    tags=["tf-serving"],
    disk=gcp.compute.InstanceTemplateDiskArgs(
        source_image="debian-cloud/debian-9",
        auto_delete=True,
        boot=True,
    ),
    network_interface=[gcp.compute.InstanceTemplateNetworkInterfaceArgs(
        network="default",
    )],
    # You will need to setup your startup-script to run TensorFlow Serving here.
    metadata_startup_script="""
        #! /bin/bash
        sudo apt-get update
        sudo apt-get install -y tensorflow-model-server
        tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=your_model --model_base_path=gs://your_model_bucket/
    """
)

# Create an instance group based on the defined template
instance_group_manager = gcp.compute.InstanceGroupManager("tf-serving-group",
    base_instance_name="tf-serving-instance",
    instance_template=instance_template.id,
    target_size=2,
    zone="us-central1-f")

# Create a backend service to manage the instance group with the health check
backend_service = gcp.compute.BackendService("tf-serving-backend",
    backends=[gcp.compute.BackendServiceBackendArgs(
        group=instance_group_manager.instance_group,
    )],
    health_checks=[health_check.id],
    port_name="http",
    protocol="HTTP",
    timeout_sec=10)

# Reserve a static external IP address
address = gcp.compute.Address("tf-serving-address")

# Create a URL map to route incoming requests to the backend service
url_map = gcp.compute.URLMap("tf-serving-url-map",
    default_service=backend_service.id)

# Setup an HTTP proxy to route requests from the global forwarding rule to URL map
http_proxy = gcp.compute.TargetHttpProxy("tf-serving-http-proxy",
    url_map=url_map.id)

# Create a global forwarding rule that routes incoming traffic to the HTTP proxy
global_forwarding_rule = gcp.compute.GlobalForwardingRule("tf-serving-forward-rule",
    ip_protocol="TCP",
    port_range="80",
    target=http_proxy.id,
    ip_address=address.address)

# Export the external IP address to access the TensorFlow Serving
pulumi.export('tf_serving_ip', address.address)
```

Here's what each part of the code is doing:

- A health check (`gcp.compute.HealthCheck`) is set up to ensure that the TensorFlow Serving instances are up and responsive. We use TCP health checks on port 8501, which is the convention for TensorFlow Serving's REST API.

- An instance template (`gcp.compute.InstanceTemplate`) is created with a startup script that installs TensorFlow Serving and runs a model server. You need to replace `'gs://your_model_bucket/'` with the path to your actual model storage bucket.

- An instance group manager (`gcp.compute.InstanceGroupManager`) uses that template to create a scalable group of instances across which our model serving requests will be load balanced.

- A backend service (`gcp.compute.BackendService`) connects the instance group with the health check to balance traffic across healthy instances only.

- A reserved IP address (`gcp.compute.Address`) is utilized to have a consistent endpoint.

- A URL map (`gcp.compute.URLMap`) and HTTP proxy (`gcp.compute.TargetHttpProxy`) are created to direct traffic hitting the IP address to the backend service.

- Finally, a global forwarding rule (`gcp.compute.GlobalForwardingRule`) is created to forward traffic from the reserved IP address to the HTTP proxy.

The `pulumi.export` at the end will output the IP address of your load balancer so that you know where to send requests for your model.

Keep in mind this script requires additional configuration for a real-world setup, such as securing the communications with HTTPS and setting up domain names. This program is primarily to demonstrate the infrastructure orchestration using Pulumi.