Auto-scaling Compute Engine instances for distributed ML training

Question

Pulumi · Accepted Answer

Auto-scaling in the cloud allows you to automatically adjust the number of compute instances based on demand. This is especially useful for distributed machine learning (ML) training, where workloads can be highly variable.

To achieve auto-scaling, you typically need a Managed Instance Group (MIG) that governs the instances dedicated to ML training. A MIG enables you to operate a fleet of instances as a single entity, making it easier to scale them out or in. Additionally, you'll use an Autoscaler which automatically adjusts the size of the managed instance group based on the load.

Here's how to set this up with Pulumi in Python for Google Cloud:

- Use the `InstanceTemplate` to define the properties for each instance in the group.
- Create a `ManagedInstanceGroup` based on that instance template.
- Attach an `Autoscaler` to the ManagedInstanceGroup that responds to the desired metrics, such as CPU utilization.

I'll provide a Pulumi program that sets up these resources for a Google Cloud project, taking the following steps:

1. Define an instance template (`google-native.compute/v1.InstanceTemplate`) that specifies the configuration for instances in the managed instance group, including machine type and disk configuration suitable for ML tasks.

2. Create a managed instance group (`google-native.compute/v1.InstanceGroupManager`) with this instance template, which allows the instances to be managed as a single entity.

3. Attach an autoscaler (`google-native.compute/v1.Autoscaler`) to the instance group manager, which will automatically scale the number of instances in the group based on specified criteria, such as CPU load or a custom metric related to your ML training needs.

Here is the complete Pulumi program written in Python to create auto-scaling Google Compute Engine instances for distributed ML training:

```python
import pulumi
import pulumi_gcp as gcp

# Define an instance template for ML training instances.
ml_instance_template = gcp.compute.InstanceTemplate("ml-instance-template",
    properties=gcp.compute.InstanceTemplatePropertiesArgs(
        machine_type="n1-standard-4",  # Adjust machine type as needed for your ML workload.
        disks=[gcp.compute.InstanceTemplatePropertiesDisksArgs(
            boot=True,
            initialize_params=gcp.compute.InstanceTemplatePropertiesDisksInitializeParamsArgs(
                image="projects/debian-cloud/global/images/family/debian-9",  # Choose the OS image suitable for ML.
                disk_size_gb=50,  # Define the disk size.
            ),
        )],
        network_interfaces=[gcp.compute.InstanceTemplatePropertiesNetworkInterfacesArgs(
            network="default",  # Replace with your VPC network if required.
        )],
        # Define any other properties required for your ML instances.
    ))

# Create a managed instance group using the instance template.
ml_instance_group_manager = gcp.compute.InstanceGroupManager("ml-instance-group-manager",
    base_instance_name="ml-instance",  # Naming prefix for instances.
    instance_template=ml_instance_template.self_link,
    target_size=1,  # Initial number of instances, will be managed by autoscaler.
    zone="us-central1-a",  # Define the zone where instances are to be deployed.
)

# Attach an autoscaler to the managed instance group.
ml_autoscaler = gcp.compute.Autoscaler("ml-autoscaler",
    target=ml_instance_group_manager.self_link,
    zone="us-central1-a",
    autoscaling_policy=gcp.compute.AutoscalerAutoscalingPolicyArgs(
        max_replicas=10,  # Maximum number of instances.
        min_replicas=1,   # Minimum number of instances.
        cpu_utilization=gcp.compute.AutoscalerAutoscalingPolicyCpuUtilizationArgs(
            target=0.6,  # Target CPU utilization to trigger scaling actions.
        ),
        cooldown_period=60,  # Cooldown period after scaling actions, in seconds.
    ))

# Export the URL of the managed instance group, which can be used for monitoring or management tasks.
pulumi.export("instance_group_url", ml_instance_group_manager.self_link)
```

In this program:

- We define an `ml_instance_template` specifying the compute resources needed for a typical machine learning training task. This includes specifying the `machine_type`, `disk` configuration, `network` settings, and any other requirements specific to your workload.
- We create an `ml_instance_group_manager` which will manage instances created based on the template. We start with a target size of 1 instance, but this will be managed by the autoscaler based on demand.
- We establish `ml_autoscaler` linked to the instance group manager using the autoscaler policy. This autoscaler is configured to keep the CPU utilization at around 60%, scaling up to a maximum of 10 instances and scaling down to a minimum of 1 instance.

I recommend reviewing the Pulumi documentation for the [InstanceTemplate](https://www.pulumi.com/registry/packages/gcp/api-docs/compute/instancetemplate/), [InstanceGroupManager](https://www.pulumi.com/registry/packages/gcp/api-docs/compute/instancegroupmanager/), and [Autoscaler](https://www.pulumi.com/registry/packages/gcp/api-docs/compute/autoscaler/) resources to understand all the available options to ensure they meet your exact requirements for distributed ML training.