Auto-scaling Inference Services with GKE for LLMs

Question

Pulumi · Accepted Answer

Auto-scaling inference services with Google Kubernetes Engine (GKE) for Large Language Models (LLMs) is a process where you deploy your inference service on a Kubernetes cluster in GKE and configure it to automatically scale up or down based on the workload demand. In such scenarios, we use resources like the GKE Cluster to host the inference service and the Kubernetes Horizontal Pod Autoscaler (HPA) to handle auto-scaling.

Here is how the process generally works:

1. **GKE Cluster**: First, we need to create a Kubernetes cluster in GKE, which is a set of nodes that run containerized applications. This cluster is the foundation of where our inference services will run.

2. **Node Pools**: Within the GKE cluster, we define node pools which are groups of nodes. These nodes can be configured with specific machine types and other settings that are tailored to the requirements of LLMs.

3. **Deployment**: On the GKE cluster, we then deploy the inference service as a Kubernetes Deployment. It defines the desired state, including which container image to run and how many replicas (instances of the service).

4. **Horizontal Pod Autoscaler (HPA)**: To automatically adjust the number of running replicas of a containerized application (the inference service), we employ an HPA. The HPA scales the number of replicas up or down based on metrics like CPU utilization or custom metrics provided by the application.

5. **Autoscaling Cluster**: Sometimes, scaling the number of pods isn't enough if there aren't enough nodes in the cluster to schedule these pods. We configure the cluster autoscaler to manage the number of nodes automatically based on utilization.

Let's create a simple Pulumi program that sets up a GKE cluster capable of auto-scaling for inference services. The following program will:

- Create a GKE cluster.
- Define a node pool for workloads (like LLMs) that may need more computational resources.
- Apply a Horizontal Pod Autoscaler configuration to scale the number of pods automatically.

```python
import pulumi
import pulumi_gcp as gcp

# Create a GKE Cluster
# Details: https://www.pulumi.com/registry/packages/gcp/api-docs/container/cluster/
cluster = gcp.container.Cluster("llm-inference-cluster",
    initial_node_count=3,
    min_master_version="latest",
    node_version="latest",
    node_config={
        "machine_type": "e2-highmem-4",  # Choose the right machine type for your LLM needs
        "oauth_scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
        ],
    })

# Create a node pool optimized for compute-heavy tasks such as LLM inference
# Details: https://www.pulumi.com/registry/packages/gcp/api-docs/container/nodepool/
node_pool = gcp.container.NodePool("inference-node-pool",
    cluster=cluster.name,
    node_count=1,
    autoscaling={
        "min_node_count": 1,
        "max_node_count": 4,  # Adjust maximum number based on your expected workload
    },
    node_config={
        "machine_type": "n1-standard-4",  # Machines with good balance of memory and CPU
        "preemptible": True,  # Use preemptible VMs for cost savings (optional)
        "oauth_scopes": [
            "https://www.googleapis.com/auth/cloud-platform",
        ],
    })

# Export important attributes for the cluster and the node pool
pulumi.export("cluster_name", cluster.name)
pulumi.export("node_pool_name", node_pool.name)
pulumi.export("cluster_endpoint", cluster.endpoint)
pulumi.export("cluster_kubeconfig", cluster.master_auth.apply(lambda x: x.cluster_ca_certificate))
```

In this program:
- We create a GKE cluster (`llm-inference-cluster`) with an initial node count.
- The cluster's nodes use the `e2-highmem-4` machine type setup, which has a substantial amount of memory, making it appropriate for memory-intensive tasks like LLM inference. The OAuth scope is set to allow full access to cloud-platform resources.
- We then define a separate node pool (`inference-node-pool`) which is optimized for compute-heavy tasks and can scale from 1 to 4 nodes as necessary.
- The node pool is configured to use `n1-standard-4` machine types, which are balanced in terms of memory and CPU. These nodes are set as preemptible, which means they can be terminated and replaced by GCP at any time but are cheaper than regular instances.
- We export various attributes of the cluster and the node pool which can be valuable for configuring CI/CD pipelines or for programmatic access to the cluster.

Keep in mind that the actual inference service (as a Kubernetes Deployment) and the Horizontal Pod Autoscaler setup are not included in this example. Those would need to be defined according to the specifics of the application you are running, including setting up metrics to determine when to scale the service instances.