ReadWriteMany Volumes for Concurrent AI Model Training

Question

Pulumi · Accepted Answer

When deploying a concurrent AI model training infrastructure on the cloud, you typically need a mechanism to share data across multiple training instances. A common solution for this is to use ReadWriteMany (RWX) volumes that allow multiple nodes to read from and write to the same filesystem concurrently.

For this purpose, Kubernetes is a popular choice because it offers persistent volume claims (PVCs) that can be configured as RWX. In cloud providers such as AWS or Azure, you can use their native services (like AWS EFS or Azure Files) to back these PVCs.

To create such an infrastructure, you would:

1. Set up a Kubernetes cluster.
2. Deploy a storage service that supports RWX volumes.
3. Create a Persistent Volume (PV) that references the storage service.
4. Create a Persistent Volume Claim (PVC) with a ReadWriteMany access mode.
5. Deploy your AI training applications, referencing the PVC for shared storage.

Here we'll create Python code for a Pulumi program that sets up a Kubernetes cluster and deploys an Azure Files-backed PVC with RWX access mode.

The resources utilised will be:

- Azure Kubernetes Service (AKS) for the Kubernetes cluster.
- Azure Files, a fully managed file-sharing service for cloud or on-premises deployments that supports SMB and NFS protocols and RWX access mode for Kubernetes PVCs.

Let's start with the Pulumi program.

```python
import pulumi
import pulumi_azure_native as azure_native
from pulumi_azure_native import resources
from pulumi_azure_native import storage
from pulumi_azure_native import containerservice
from pulumi_kubernetes import Provider, core

# Configuration for the Azure Resource Group
resource_group = resources.ResourceGroup('ai_rg')

# Configuration for the Azure Storage Account and File Share
storage_account = storage.StorageAccount('aistorageaccount',
    resource_group_name=resource_group.name,
    kind=storage.Kind.STORAGE_V2,
    sku=storage.SkuArgs(name=storage.SkuName.STANDARD_LRS))

file_share = storage.FileShare('aifileshare',
    resource_group_name=resource_group.name,
    account_name=storage_account.name,
    share_name='aisharedfiles',
    enabled_protocols=storage.EnabledProtocols.SMB,
    # Additional properties like quota can be set here
)

# Configuration for the Azure Kubernetes Service (AKS)
managed_cluster = containerservice.ManagedCluster('aimanagedcluster',
    resource_group_name=resource_group.name,
    agent_pool_profiles=[{
        'count': 3,  # Number of nodes in the Node Pool
        'max_pods': 110,
        'mode': 'System',
        'name': 'agentpool',
        'node_labels': {},
        'os_disk_size_gb': 30,
        'os_type': 'Linux',
        'vm_size': 'Standard_DS2_v2',
    }],
    dns_prefix='ai-dns',
    enable_rbac=True,
    # We can specify a Kubernetes version, but for the sake of this example, we'll use the default one
)

# Now we will create a Kubernetes Provider pointing to the created AKS
k8s_provider = Provider('k8sprovider',
    kubeconfig=managed_cluster.kube_config_raw,
)

# Create a Kubernetes Persistent Volume (PV) that points to Azure Files
# Ensure that the Secret name matches the one specified in the storage class
azure_file_pv = core.v1.PersistentVolume('aifilepv',
    api_version='v1',
    kind='PersistentVolume',
    metadata={'name': 'aifilepv'},
    spec=core.v1.PersistentVolumeSpecArgs(
        capacity={'storage': '5Gi'},
        access_modes=['ReadWriteMany'],
        azure_file=core.v1.AzureFilePersistentVolumeSourceArgs(
            secret_name='azure-secret',
            share_name=file_share.name,
            read_only=False,
        ),
        persistent_volume_reclaim_policy='Retain',
    ),opts=pulumi.ResourceOptions(provider=k8s_provider))

# Create a Kubernetes Persistent Volume Claim (PVC) that can be used by multiple pods to write concurrently
azure_file_pvc = core.v1.PersistentVolumeClaim('aifilepvc',
    api_version='v1',
    kind='PersistentVolumeClaim',
    metadata={'name': 'aifilepvc'},
    spec=core.v1.PersistentVolumeClaimSpecArgs(
        access_modes=['ReadWriteMany'],
        resources=core.v1.ResourceRequirementsArgs(
            requests={'storage': '5Gi'}),
        volume_name=azure_file_pv.metadata['name'],
    ),opts=pulumi.ResourceOptions(provider=k8s_provider))

# Export the kubeconfig to be used outside of Pulumi (useful for direct kubectl use)
pulumi.export('kubeconfig', managed_cluster.kube_config_raw)
```

This program does the following:
- Creates an Azure Resource Group to organize all resources.
- Provisions an Azure Storage Account and an Azure File Share to be used as the backing store for the Kubernetes Persistent Volumes.
- Sets up Azure Kubernetes Service (AKS) with a Node Pool that can host our training workload nodes.
- Creates a Pulumi Kubernetes Provider that encapsulates the configuration needed to communicate with the AKS cluster.
- Uses the Kubernetes provider to create a Persistent Volume (PV) backed by Azure Files, and a Persistent Volume Claim (PVC) which can be mounted by multiple pods at the same time (with ReadWriteMany access mode).

This infrastructure setup will allow you to deploy pods in AKS that can concurrently access the shared Azure File storage, making it suitable for distributed AI model training tasks.

Remember that before running the Pulumi program, you need to set up your Pulumi and Azure CLI and authenticate with appropriate permissions to create these resources. After you've done that, to run the provided Pulumi program, use the following commands:
- `pulumi up` - to preview and deploy changes
- `pulumi destroy` - to clean up resources
- `pulumi stack` - to manage pulumi stacks (states)

The `kubeconfig` exported at the end can be used to directly interact with the Kubernetes cluster using `kubectl`.

Now you would typically move forward to deploying your AI model training applications, which would mount this RWX PVC and perform concurrent training tasks, sharing data through this centrally managed volume.