Autoscaling GPU clusters for Deep Learning in OCI

Question

Pulumi · Accepted Answer

Autoscaling GPU clusters are essential for deep learning tasks as they can adjust the computing resources based on the workload requirements. In Oracle Cloud Infrastructure (OCI), you can create such an autoscaling cluster using a combination of services, including Compute Instances, Instance Pools, and Autoscaling Configurations.

Below is an explanation and corresponding Pulumi Python program that sets up an autoscaling GPU cluster suitable for deep learning tasks in OCI.

Firstly, we create a Compute Instance Configuration which describes the setup of each instance in the cluster, including GPU shapes and other specifications necessary for deep learning workloads.

Then, we establish an Instance Pool, a collection of identical compute instances managed as a single entity, which utilizes the instance configuration previously defined.

Next, an Autoscaling Configuration is attached to the Instance Pool to automatically adjust the number of instances in response to the workload. It defines rules based on CPU or memory utilization thresholds that dictate when to scale in (remove instances) or scale out (add instances).

The following Pulumi Python program demonstrates how you might set these up. Note that the `oci` Pulumi package is used for resources in Oracle Cloud Infrastructure:

```python
import pulumi
import pulumi_oci as oci

# Configuration variables for your OCI environment
# Make sure to replace these with your own values or look them up dynamically
compartment_id = 'YOUR_COMPARTMENT_ID'
availability_domain = 'YOUR_AVAILABILITY_DOMAIN'
subnet_id = 'YOUR_SUBNET_ID'
image_id = 'YOUR_GPU_INSTANCE_IMAGE_ID'  # The image ID for your GPU instance
shape = 'YOUR_GPU_SHAPE'  # The specific GPU shape for deep learning

# Create an instance configuration for GPU instances
gpu_instance_config = oci.core.InstanceConfiguration("gpuInstanceConfig",
    compartment_id=compartment_id,
    instance_details=oci.core.InstanceConfigurationInstanceDetailsArgs(
        instance_type="compute",
        block_volumes=None,  # Specify attached block volumes if necessary
        launch_details=oci.core.InstanceConfigurationLaunchDetailsArgs(
            availability_domain=availability_domain,
            compartment_id=compartment_id,
            display_name="DeepLearningInstance",
            image_id=image_id,
            shape=shape,
            create_vnic_details=oci.core.InstanceConfigurationCreateVnicDetailsArgs(
                subnet_id=subnet_id,
            ),
            # Further properties like metadata, agent configurations, and others can be specified
        ),
    ))

# Create an instance pool using the instance configuration
gpu_instance_pool = oci.core.InstancePool("gpuInstancePool",
    compartment_id=compartment_id,
    instance_configuration_id=gpu_instance_config.id,
    size=1,  # Start with a pool size of 1, autoscaling will adjust this
    placement_configurations=[oci.core.InstancePoolPlacementConfigurationArgs(
        availability_domain=availability_domain,
        primary_subnet_id=subnet_id,
        # Specify secondary VNICs and fault domains if necessary
    )])

# Define the autoscaling configuration
autoscale_config = oci.autoscaling.AutoScalingConfiguration("autoscaleConfig",
    resource_id=gpu_instance_pool.id,
    policies=[oci.autoscaling.AutoScalingConfigurationPolicyArgs(
        capacity=oci.autoscaling.AutoScalingConfigurationPolicyCapacityArgs(
            initial=1,
            max=10,  # Set your maximum number of instances
            min=1   # Set your minimum number of instances
        ),
        rules=[oci.autoscaling.AutoScalingConfigurationPolicyRuleArgs(
            action="CHANGE_COUNT_BY",
            value=1,  # Number of instances to add or remove during scaling
            metric=oci.autoscaling.AutoScalingConfigurationPolicyRuleMetricArgs(
                metric_type="CPU_UTILIZATION",  # You can also define custom metrics
                threshold=oci.autoscaling.AutoScalingConfigurationPolicyRuleMetricThresholdArgs(
                    operator="GT",  # Greater than operator for scaling out
                    value=75  # Target utilization percentage to trigger scaling out
                ),
                # Define an additional rule for scaling in, if desired
            ),
        )],
        # Define other policy details as needed
    )],
    compartment_id=compartment_id,
    display_name="DeepLearningAutoscaleConfig",
    is_enabled=True)

# Export the instance pool and autoscale configuration IDs
pulumi.export("gpu_instance_pool_id", gpu_instance_pool.id)
pulumi.export("autoscale_configuration_id", autoscale_config.id)
```

This program starts by defining a Compute Instance Configuration for the GPU instances, which is then used to create an Instance Pool. The size of this pool is initially set to 1, and it will be managed automatically based on the Autoscaling Configuration defined afterwards. The autoscaling policy includes rules that trigger scaling actions when the average CPU utilization crosses a specified threshold.

Please replace `'YOUR_COMPARTMENT_ID'`, `'YOUR_AVAILABILITY_DOMAIN'`, `'YOUR_SUBNET_ID'`, `'YOUR_GPU_INSTANCE_IMAGE_ID'`, and `'YOUR_GPU_SHAPE'` with the actual values that pertain to your environment in Oracle Cloud Infrastructure.

To deploy this infrastructure, first ensure you've installed the Pulumi CLI and configured it for use with Oracle Cloud Infrastructure. Then, save this program in a file named `__main__.py`, initialize a Pulumi project, install the required OCI Pulumi plugin by running `pulumi plugin install resource oci <VERSION>`, and run `pulumi up` to create the resources. The `pulumi.export` lines will output the IDs of the created resources which can be very helpful for further management and referencing within your OCI environment.

Keep in mind that real-world setups for deep learning tasks might require additional considerations such as specific GPU drivers, deep learning libraries, and storage configurations, which should be addressed within instance provisioning scripts or setup commands.