Autoscaling ML Model Inference Servers on GCP

Question

Pulumi · Accepted Answer

To create an autoscaling ML model inference server on Google Cloud Platform (GCP) using Pulumi, we need to use a combination of GCP's Machine Learning (ML) and Compute resources.

1. **ML Model**: First, we define an ML model using the `gcp.ml.EngineModel` resource. This will create a new model in Google Cloud Machine Learning Engine, which can be used for serving online predictions.
   
2. **Compute Instance Template**: We define a Compute Engine instance template using the `gcp.compute.InstanceTemplate` resource. This template specifies the configuration of the instances that will serve the model, including the machine type, disk, and any startup scripts needed to install the inference server.

3. **Instance Group Manager**: We utilize the `gcp.compute.InstanceGroupManager` to create and manage a group of identical instances based on the previously defined instance template. This manager can also be configured to automatically heal instances, replacing them if they become unhealthy.

4. **Autoscaler**: An autoscaler is created using the `gcp.compute.Autoscaler` resource, which automatically scales the number of instances in the managed instance group based on the defined utilization policy.

Below is a Pulumi program in Python that implements an autoscaling ML model inference server setup. This program assumes that you have a machine learning model ready to be served and the necessary scripts to install your ML inference server upon instance startup.

```python
import pulumi
import pulumi_gcp as gcp

# Replace the placeholders with the actual values for your resources
project = "your-gcp-project"
model_name = "your-ml-model-name"
instance_template_name = "ml-model-server-template"
instance_group_manager_name = "ml-model-server-group"
autoscaler_name = "ml-model-server-autoscaler"
region = "us-central1"

# Create the Machine Learning Engine Model
ml_model = gcp.ml.EngineModel("ml-engine-model",
    project=project,
    name=model_name,
    description="ML Model for online predictions",
    regions=[region],
    # Define other properties of your ML model here, if necessary
)

# Define the Compute Engine Instance Template for inference servers
instance_template = gcp.compute.InstanceTemplate("instance-template",
    project=project,
    name=instance_template_name,
    machine_type="n1-standard-1",
    disk=[{
        "boot": True,
        "autoDelete": True,
        "sourceImage": "projects/debian-cloud/global/images/family/debian-9",
        # Specify the necessary disk configurations
    }],
    network_interfaces=[{
        "network": "default",
        # Omitting accessConfigs defaults to egress-only internet access
    }],
    # Specify startup script to install and run the ML inference server
    metadata_startup_script="startup-script.sh",
)

# Create an Instance Group Manager for managing the group of inference instances
instance_group_manager = gcp.compute.InstanceGroupManager("instance-group-manager",
    project=project,
    name=instance_group_manager_name,
    base_instance_name="inference-instance",
    instance_template=instance_template.self_link,
    target_size=1,  # Start with 1 instance and let the autoscaler scale as necessary
    zone=region + "-a",  # Specify the appropriate zone for your use case
    # Define other properties such as auto-healing policies if needed
)

# Set up an Autoscaler to automatically adjust the number of instances
autoscaler = gcp.compute.Autoscaler("autoscaler",
    project=project,
    name=autoscaler_name,
    target=instance_group_manager.self_link,
    autoscaling_policy={
        "min_replicas": 1,
        "max_replicas": 5,  # Adjust max replicas based on your needs
        "cpu_utilization": {
            "target": 0.6,  # Target CPU usage; scale out when above this percentage
        },
        "cool_down_period": 90,  # Seconds to wait after a new instance has started before evaluating conditions
    },
    zone=region + "-a",
)

# Export the URL of the created model to access later
pulumi.export("ml_model_id", ml_model.id)
```

This program sets up your infrastructure for autoscaling ML model inference servers. It starts with a single instance and scales out to more instances as CPU usage increases, helping to manage cost and performance.

### Things to Note:

- Remember to replace placeholder variables (`project`, `model_name`, etc.) with the corresponding values for your GCP resources.
- The `metadata_startup_script` should specify the path to your startup script stored in Google Cloud Storage, or inline the script contents directly. This script is responsible for installing necessary software and starting the inference server.
- The `autoscaler` resource's `cpu_utilization` is used to determine when to scale in or out. Set the `target` attribute to the CPU utilization threshold that triggers scaling.
- Adjust `min_replicas` and `max_replicas` according to your specific needs, which control the minimum and maximum number of instances the `autoscaler` can manage.
- Resource names must be unique within a project and comply with GCP's naming conventions.

By running this program with Pulumi, these resources will be provisioned on GCP in accordance with your specifications, enabling an autoscaling environment for your ML model inference workload.