Scalable VM Storage Backend for AI Clusters

Question

Pulumi · Accepted Answer

To build a scalable VM storage backend for AI clusters, you can leverage various cloud services that offer robust and scalable storage solutions. For AI clusters, a combination of high-performance block storage for computation and object storage for data persistence is often needed.

Here we'll consider using Azure as the cloud provider. Azure offers various storage services, including Azure Managed Disks, which provide high-performance block storage suitable for VMs in compute clusters, and Azure Blob Storage, which is an object storage service for large-scale data storage.

We will create a Pulumi program that sets up the following resources:
1. A resource group to organize the resources.
2. A storage account for blob storage.
3. Azure Managed Disks attached to a virtual machine scale set.

The virtual machine scale set will simulate the compute cluster's VMs, which would process AI workloads. The attached Managed Disks would serve as the block storage for computation, and the storage account's blob containers could be used to store datasets or results of AI models.

Here's how you would set up such a backend using Pulumi and Python:

```python
import pulumi
import pulumi_azure_native as azure_native
from pulumi_azure_native import compute, network, resources, storage

# Create an Azure Resource Group
resource_group = resources.ResourceGroup("ai_storage_resource_group")

# Create an Azure Storage Account (Blob) for persistent data storage
storage_account = storage.StorageAccount(
    "aistorageaccount",
    resource_group_name=resource_group.name,
    sku=storage.SkuArgs(name=storage.SkuName.STANDARD_LRS),
    kind=storage.Kind.STORAGE_V2
)

# Create a Virtual Network and Subnet for the VM scale set
virtual_network = network.VirtualNetwork(
    "aivnet",
    resource_group_name=resource_group.name,
    address_space=network.AddressSpaceArgs(address_prefixes=["10.0.0.0/16"])
)

subnet = network.Subnet(
    "aisubnet",
    resource_group_name=resource_group.name,
    virtual_network_name=virtual_network.name,
    address_prefix="10.0.1.0/24",
)

# Create a Virtual Machine Scale Set with attached Managed Disks for computation
vmss = compute.VirtualMachineScaleSet(
    "aivmss",
    resource_group_name=resource_group.name,
    sku=compute.SkuArgs(name="Standard_DS1_v2", capacity=3),  # Choose appropriate VM size
    os_profile=compute.VirtualMachineScaleSetOSProfileArgs(
        computer_name_prefix="aivm",
        admin_username="adminuser",
        admin_password="Password##1234"
    ),
    network_profile=compute.VirtualMachineScaleSetNetworkProfileArgs(
        network_interface_configurations=[
            compute.VirtualMachineScaleSetNetworkConfigurationArgs(
                name="aivmnic",
                primary=True,
                ip_configurations=[
                    compute.VirtualMachineScaleSetIPConfigurationArgs(
                        name="aivmssipconfig",
                        subnet=compute.ApiEntityReferenceArgs(id=subnet.id)
                    )
                ]
            )
        ]
    ),
    virtual_machine_profile=compute.VirtualMachineScaleSetVMProfileArgs(
        storage_profile=compute.VirtualMachineScaleSetStorageProfileArgs(
            os_disk=compute.VirtualMachineScaleSetOSDiskArgs(
                caching=compute.CachingTypes.READ_WRITE,
                create_option=compute.DiskCreateOptionTypes.FROM_IMAGE,
                managed_disk=compute.VirtualMachineScaleSetManagedDiskParametersArgs(
                    storage_account_type=compute.StorageAccountTypes.PREMIUM_LRS
                ),
            ),
            data_disks=[
                compute.VirtualMachineScaleSetDataDiskArgs(
                    lun=0,
                    caching=compute.CachingTypes.READ_WRITE,
                    disk_size_gb=100,
                    create_option=compute.DiskCreateOptionTypes.EMPTY
                )
            ]
        )
    ),
)

# Output the connection info for the AI Cluster
pulumi.export('vmss_id', vmss.id)
pulumi.export('storage_account', storage_account.name)
pulumi.export('storage_account_primary_endpoints', storage_account.primary_endpoints)
```
This program sets up a scalable VM storage backend suitable for AI cluster workloads. It creates all resources in a resource group for easy management and possible tear-down. The VM scale set is configured with a small instance type and default Linux OS for this example, but you should select an appropriate instance size and OS based on the workload requirements. The Managed Disks provide local high-performance storage for computation, and the storage account's Blob Service acts as persistent storage for data that needs to outlive the lifecycle of the VM scale set instances, such as datasets for AI models.