Auto-scaled GPU Instances for AI Workloads

Question

Pulumi · Accepted Answer

Auto-scaling GPU instances for AI workloads is an approach to dynamically adjust the number of GPU instances in the cloud based on current demands, ensuring that your AI applications have enough computational power while keeping costs under control.

To set up auto-scaled GPU instances, you generally need the following:

1. **Instance Template**: This defines the blueprint for the instances that will be launched, including the machine types (with GPU), disk settings, network settings, and other instance configurations.

2. **Managed Instance Group**: This group uses the instance template to manage the lifecycle of instances, handling creation, deletion, and automatic scaling.

3. **Autoscaler**: The autoscaler resource is attached to the managed instance group and adjusts the number of instances in the group based on the defined scaling policies.

In the context of Google Cloud Platform (GCP), which offers comprehensive support for GPU instances, you could use `InstanceTemplate` to define the setup with GPU, `Managed Instance Group` to manage the group of such instances, and `Autoscaler` to handle the scaling based on the workload.

Below is a complete Pulumi program in Python that sets up an auto-scaled GPU instance group for AI workloads in GCP. It describes creating an instance template configured with GPUs, then using that template in a managed instance group, and finally defining an autoscaler to adjust the group size based on CPU utilization.

```python
import pulumi
import pulumi_gcp as gcp

# Define the GPU instance template
gpu_instance_template = gcp.compute.InstanceTemplate("gpu-instance-template",
    machine_type="n1-standard-1",  # the type of machine to use, this is based on the requirement
    tags=["ai-gpu-instance"],  # tagging instances may come handy for networking rules or tracking costs
    disks=[{
        "boot": True,
        "autoDelete": True,
        "deviceName": "instance-template",
        "initializeParams": {  # setup of the boot disk
            "image": "your-os-image",  # provide your custom image or one of the public ones
            "sizeGb": 50,
        },
    }],
    # GPUs are specified within the "guestAccelerators" field
    guest_accelerators=[{
        "type": "nvidia-tesla-k80",  # specify the type of GPU
        "count": 1,  # number of GPUs
    }],
    scheduling={
        "onHostMaintenance": "TERMINATE",  # instance may terminate upon host maintenance
        "preemptible": False,  # not a preemptible instance since it hosts critical workloads
    },
    service_accounts=[{  # service account with required permissions
        "email": "default",
        "scopes": ["https://www.googleapis.com/auth/cloud-platform"],
    }],
)

# Define the managed instance group using the instance template
managed_instance_group = gcp.compute.InstanceGroupManager("managed-instance-group",
    base_instance_name="ai-gpu-instance",  # name prefix for instances
    instance_template=gpu_instance_template.self_link,  # link to the instance template created earlier
    target_size=1,  # starting size of the group
    zone="us-central1-a",  # deployment zone, choose as per proximity to your users or data
)

# Define the autoscaler
autoscaler = gcp.compute.Autoscaler("autoscaler",
    target=managed_instance_group.self_link,  # attach to our instance group
    zone="us-central1-a",
    autoscaling_policy={
        "minReplicas": 1,  # minimum number of running instances
        "maxReplicas": 10,  # maximum number of instances
        "cpuUtilization": {
            "target": 0.6,  # target CPU utilization to trigger scaling actions
        },
        "cooldownPeriod": 45,  # cooldown period between scaling actions
    },
)

# Export the autoscaler's name
pulumi.export("autoscaler_name", autoscaler.name)
```

In the program above:

- We first create an instance template called `gpu-instance-template`, specifying the necessary configurations such as the machine type, boot disk image, accelerator type (GPU), and service account.
  
- We then set up the managed instance group `managed-instance-group` that refers to our instance template. This group manages the lifecycle of instances it controls.
  
- Finally, we define the `autoscaler` that will automatically scale the managed instance group in and out based on the CPU utilization of the instances.

The `pulumi.export` statement will output the name of the autoscaler when the deployment succeeds, which you can use to reference and manage this autoscaler resource.

You can adjust the `machine_type`, `image`, `type` under `guest_accelerators`, and `autoscaling_policy` parameters based on the requirements of your AI workloads.