Scaling GPU Instances for Deep Learning Workloads

Question

Pulumi · Accepted Answer

To scale GPU instances for deep learning workloads you can utilize cloud service providers like AWS, GCP, or Azure to deploy VMs or containerized applications that utilize GPU resources. For this task, we can define an auto-scaling configuration that dynamically adjusts the number of GPU instances based on the workload's demands.

Below is a Pulumi program which demonstrates setting up an auto-scaling GPU-enabled Virtual Machine Scale Set in Azure. The `azure_native.compute.VirtualMachineScaleSet` resource is used to create a managed group of identical, load balanced, and auto-scaling VMs. Additionally, `azure_native.compute.VirtualMachineScaleSetExtension` could be used to install custom applications or perform configuration management tasks on the VM instances.

In this example, the scale set is configured with a custom auto-scaling rule that scales up or down based on CPU load. Deep learning workloads typically require a lot of GPU power, and while the example focuses on CPU load for simplicity, in practice, you would add custom metrics relevant to GPU utilization if they are available in your cloud provider's monitoring system.

Let's go through the steps:

1. You define the VM scale set, specifying parameters like the VM size (`Standard_NC6` is an Azure VM size which includes a single K80 GPU), the capacity, and the image to use for the VMs.
2. Define the auto-scale settings with rules based on CPU metrics.
3. Apply any custom extensions necessary for your workload, such as installing deep learning frameworks or tools.

Remember that before you run this code, you would need to set up Azure credentials for Pulumi using the Azure CLI, and having the appropriate permissions to create and manage these resources.

```python
import pulumi
import pulumi_azure_native as azure_native

# Define the resource group where the resources will be deployed.
resource_group = azure_native.resources.ResourceGroup('gpu_resource_group')

# Define the virtual network and subnet for the VM scale set.
network = azure_native.network.VirtualNetwork(
    'gpu_vnet',
    resource_group_name=resource_group.name,
    address_space=azure_native.network.AddressSpaceArgs(
        address_prefixes=['10.0.0.0/16'],
    )
)

subnet = azure_native.network.Subnet(
    'gpu_subnet',
    resource_group_name=resource_group.name,
    address_prefix='10.0.2.0/24',
    virtual_network_name=network.name
)

# Create the virtual machine scale set with GPU instances.
vmss = azure_native.compute.VirtualMachineScaleSet(
    'gpu_vmss',
    resource_group_name=resource_group.name,
    location=resource_group.location,
    sku=azure_native.compute.SkuArgs(
        name='Standard_NC6',  # Azure VM size with a single K80 GPU.
        tier='Standard',
        capacity=1,  # Initial instance count. Define your logic or scaling metrics to scale this.
    ),
    overprovision=True,
    upgrade_policy=azure_native.compute.UpgradePolicyArgs(
        mode='Manual',
    ),
    virtual_machine_profile=azure_native.compute.VirtualMachineProfileArgs(
        os_profile=azure_native.compute.OSProfileArgs(
            computer_name_prefix='gpuvm',
            admin_username='adminuser',
            # Note: You should handle secrets more securely via Azure KeyVault or Pulumi Config.
            admin_password='P@ssw0rd1234!',
        ),
        storage_profile=azure_native.compute.StorageProfileArgs(
            image_reference=azure_native.compute.ImageReferenceArgs(
                publisher='Canonical',
                offer='UbuntuServer',
                sku='18.04-LTS',
                version='latest'
            ),
        ),
        hardware_profile=azure_native.compute.HardwareProfileArgs(
            vm_size='Standard_NC6',  # This determines the GPU type.
        ),
        network_profile=azure_native.compute.NetworkProfileArgs(
            network_interface_configurations=[azure_native.compute.NetworkInterfaceConfigurationArgs(
                name='nicconfig1',
                primary=True,
                enable_accelerated_networking=True,
                ip_configurations=[azure_native.compute.IPConfigurationArgs(
                    name='IPConfiguration',
                    subnet=azure_native.compute.ApiEntityReferenceArgs(
                        id=subnet.id,
                    ),
                )],
            )],
        ),
    ),
)

# Define autoscale profile.
autoscale_setting = azure_native.insights.AutoscaleSetting(
    "autoscaleSetting",
    resource_group_name=resource_group.name,
    location=resource_group.location,
    profiles=[azure_native.insights.AutoscaleProfileArgs(
        name="autoScaleProfile",
        capacity=azure_native.insights.ScaleCapacityArgs(
            default="1",
            minimum="1",
            maximum="10",
        ),
        rules=[
            azure_native.insights.ScaleRuleArgs(
                metric_trigger=azure_native.insights.MetricTriggerArgs(
                    metric_name="Percentage CPU",
                    metric_resource_uri=vmss.id,
                    time_grain="PT1M",
                    statistic="Average",
                    time_window="PT5M",
                    time_aggregation="Average",
                    operator="GreaterThan",
                    threshold=75,
                    scale_direction="Increase",
                    scale_type="ChangeCount",
                    value=1,
                    cooldown="PT1M",
                ),
                scale_action=azure_native.insights.ScaleActionArgs(
                    direction="Increase",
                    type="ChangeCount",
                    value="1",
                    cooldown="PT5M",
                ),
            ),
            azure_native.insights.ScaleRuleArgs(
                metric_trigger=azure_native.insights.MetricTriggerArgs(
                    metric_name="Percentage CPU",
                    metric_resource_uri=vmss.id,
                    time_grain="PT1M",
                    statistic="Average",
                    time_window="PT5M",
                    time_aggregation="Average",
                    operator="LessThan",
                    threshold=25,
                    scale_direction="Decrease",
                    scale_type="ChangeCount",
                    value="1",
                    cooldown="PT1M",
                ),
                scale_action=azure_native.insights.ScaleActionArgs(
                    direction="Decrease",
                    type="ChangeCount",
                    value="1",
                    cooldown="PT5M",
                ),
            ),
        ],
        fixed_date=None,
        recurrence=None,
    )],
    target_resource_uri=vmss.id,
)

# Export the VMSS ID and the autoscale setting ID.
pulumi.export('vmss_id', vmss.id)
pulumi.export('autoscale_setting_id', autoscale_setting.id)
```

This will create a basic auto-scalable GPU-enabled Azure Virtual Machine Scale Set which will increase the number of instances when the average CPU load goes above 75% over a 5-minute period, and decrease when the CPU load is below 25%.

In addition, you can customize the scaling actions further as per the required deep learning workload performance indicators. It's also possible to integrate with Application Insights for better metrics and control. Azure Machine Learning service could also be used to enhance your deep learning workloads, providing a more comprehensive environment for managing machine learning projects.