Autoscaling Inference Clusters for LLMs with GCP

Question

Pulumi · Accepted Answer

To autoscale inference clusters for large language models (LLMs) on Google Cloud Platform (GCP), you'd typically want to use some combination of Google Kubernetes Engine (GKE) for container orchestration, Compute Engine's Instance Group Manager for managing groups of instances, and possibly an Autoscaler to automatically adjust the number of virtual machine instances based on the load.

For the purpose of setting up autoscaling, Compute Engine's Instance Group Manager can be used to create a group of virtual machines that will run the containers for the LLMs. The number of instances in the group can automatically scale up or down based on the given criteria, such as CPU utilization or custom metrics, by attaching an Autoscaler to the group.

Below is a Pulumi program written in Python that demonstrates how to create an instance template, a managed instance group, and an autoscaler that scales based on CPU usage.

```python
import pulumi
import pulumi_gcp as gcp

# Configuration values for the instance group and autoscaler
project = 'my-gcp-project'  # Replace with your GCP project ID
zone = 'us-central1-a'  # Replace with your desired GCP zone
machine_type = 'n1-standard-4'  # Adjust as necessary
image_family = 'cos-stable'  # Choose the relevant image family for your use case
image_project = 'cos-cloud'  # The project of the image family

# Create an instance template
instance_template = gcp.compute.InstanceTemplate("inference-instance-template",
    description="Instance template for inference cluster",
    machine_type=machine_type,
    disk=[gcp.compute.InstanceTemplateDiskArgs(
        source_image=f"https://www.googleapis.com/compute/v1/projects/{image_project}/global/images/family/{image_family}",
        auto_delete=True,
        boot=True,
    )],
    network_interfaces=[gcp.compute.InstanceTemplateNetworkInterfaceArgs(
        network='default',
    )],
)

# Create a managed instance group using the instance template
instance_group_manager = gcp.compute.InstanceGroupManager("inference-instance-group-manager",
    version=[gcp.compute.InstanceGroupManagerVersionArgs(
        instance_template=instance_template.id,
    )],
    target_size=1,  # Start with 1 instance and auto-scale as needed
    zone=zone,
    base_instance_name="inference-instance",
    lifecycle=gcp.compute.InstanceGroupManagerLifecycleArgs(
        create_before_destroy=True,
    ),
)

# Attach an autoscaler to the managed instance group
autoscaler = gcp.compute.Autoscaler("inference-autoscaler",
    target=instance_group_manager.self_link,
    autoscaling_policy=gcp.compute.AutoscalerAutoscalingPolicyArgs(
        min_replicas=1,
        max_replicas=10,  # Limit the number of replicas to 10
        cpu_utilization=gcp.compute.AutoscalerAutoscalingPolicyCpuUtilizationArgs(
            target=0.8,  # Target 80% CPU utilization for scaling
        ),
        cool_down_period=60,  # Cooldown period in seconds after a scale event
    ),
    zone=zone,
)

# Export the URL of the group manager
pulumi.export("instance_group_manager_url", instance_group_manager.instance_group)

# Export the URL of the autoscaler
pulumi.export("autoscaler_url", autoscaler.self_link)
```

In this program:
- We create an `InstanceTemplate` that will define the properties of the VM instances, such as machine type and disk image.
- We then create an `InstanceGroupManager` to manage a group of instances created from the instance template. The `target_size` is initially set to 1, meaning it will start with one instance.
- We attach an `Autoscaler` to the Instance Group Manager that is configured to scale based on CPU usage. We've set it to scale if the CPU utilization crosses 80%, with a maximum of 10 instances to ensure cost control.
- We then export the URLs of the managed Instance Group and the Autoscaler so that they can be easily accessed from the Pulumi dashboard or via the Pulumi CLI.

This represents a basic setup, and you may need to adjust the configuration based on the specifics of your workload, such as using GPU instances for inference, adding Cloud Monitoring and/or Logging for detailed metrics and monitoring, or setting up a custom metric to scale by if your LLM inference workload doesn't correspond directly to CPU utilization.

Remember to replace `'my-gcp-project'` with your Google Cloud project ID and `'us-central1-a'` with the zone where you'd like to deploy your resources. Adjust the image family and project to those that best fit your LLM's operating system and environment requirements.