Scalable AI Model Training on Azure Kubernetes Service

Question

Pulumi · Accepted Answer

To train AI models at scale on Azure Kubernetes Service (AKS), you will need to set up an AKS cluster, which can then be configured to run your training jobs. The training may consume significant resources, so AKS node pools can be scaled according to the workload demands. Here's how you would go about creating such an infrastructure using Pulumi with the Python programming language.

First, you'll need to create a Managed Kubernetes Cluster using the `azure-native.containerservice.ManagedCluster` class, which represents an AKS cluster in Azure.

In the example below, we'll set up a basic AKS cluster with the following characteristics:

- A single node pool which can be scaled manually or automatically as needed.
- A Linux-based environment, as many AI and ML tools have strong Linux support.
- RBAC enabled for better access control.
- An SSH key for accessing nodes directly for troubleshooting and configuration.

Below is a Pulumi program written in Python that sets up such an AKS cluster. Detailed explanations of important sections are present in the comments within the code:

```python
import pulumi
import pulumi_azure_native as azure_native
from pulumi_azure_native import containerservice, resources

# First, you need to provide a Resource Group for AKS.
resource_group = resources.ResourceGroup('ai-model-training-rg')

# Generate an SSH key to use with the AKS cluster nodes.
ssh_key = azure_native.compute.list_ssh_public_keys(
    resource_group_name=resource_group.name
)

# Set up the Managed Cluster.
managed_cluster = containerservice.ManagedCluster(
    "ai-model-training-cluster",
    # Define the location where your cluster will be created. Typically, you choose a region close to your users or data sources.
    location=resource_group.location,
    resource_group_name=resource_group.name,
    # The DNS prefix that is used to create the FQDN for the AKS cluster.
    dns_prefix="aimodeltrainingdns",
    # Enabling RBAC for secure access
    enable_rbac=True,
    identity=containerservice.ManagedClusterIdentityArgs(
        type="SystemAssigned",
    ),
    # Agent Pool Profile is where you define the number and type of nodes in the cluster.
    agent_pool_profiles=[containerservice.ManagedClusterAgentPoolProfileArgs(
        name="defaultpool",
        # Define the size of the VMs in the node pool.
        vm_size="Standard_DS2_v2",
        # Starting count of nodes in the pool
        count=3,
        # Enable auto-scaling for the node pool to scale the number of nodes up and down as required.
        enable_auto_scaling=True,
        # Minimum number of nodes for auto-scaling
        min_count=1,
        # Maximum number of nodes for auto-scaling
        max_count=5,
        # Types of Operating systems, 'Linux' is generally preferred for AI/ML workloads
        os_type="Linux"
    )],
    # Define the Linux profile for the AKS cluster. It includes the SSH key that will be used to manage the nodes.
    linux_profile=containerservice.ContainerServiceLinuxProfileArgs(
        admin_username="aksuser",
        ssh=containerservice.ContainerServiceSshConfigurationArgs(
            public_keys=[
                containerservice.ContainerServiceSshPublicKeyArgs(
                    # Assumes the SSH key is already generated and can be retrieved from the `list_ssh_public_keys` function call.
                    key_data=ssh_key.apply(lambda list_result: list_result.ssh_public_keys[0].public_key)
                ),
            ],
        ),
    ),
)

# Export the AKS Cluster's kubeconfig for connecting with the Kubernetes cluster once it's up and running.
kubeconfig = pulumi.Output.all(resource_group.name, managed_cluster.name).apply(
    lambda args: azure_native.containerservice.list_managed_cluster_user_credentials(
        resource_group_name=args[0],
        resource_name=args[1]
    ).apply(
        lambda creds: creds.kubeconfigs[0].value.apply(
            lambda enc: enc.decode('utf-8')
        )
    )
)

pulumi.export('kubeconfig', kubeconfig)
```

In this code, we do the following:

1. Import the necessary modules from Pulumi's Azure Native provider.
2. Create a new Resource Group dedicated to our AI model training infrastructures, such as the Kubernetes cluster and related resources.
3. Generate SSH keys using Azure's `list_ssh_public_keys`, which are needed to securely connect to the AKS nodes.
4. Define a Managed Kubernetes Cluster with a single agent pool profile.
    - We specify the desired region, the number of nodes, the type of VMs for the nodes, and that we want these nodes to be Linux machines with an SSH key.
    - The node pool is configured with auto-scaling enabled to automatically adjust the number of nodes based on demand, within the defined range (min_count and max_count).
5. Export the `kubeconfig`, which is necessary to communicate with your Kubernetes cluster using `kubectl` or any Kubernetes client library.

To access the `kubeconfig` which will allow you to manage the Kubernetes cluster, you can run:

```sh
pulumi stack output kubeconfig > kubeconfig.yaml
export KUBECONFIG=$(pwd)/kubeconfig.yaml
```

Then you can manage your Kubernetes resources using `kubectl` or other Kubernetes compatible tools, such as deploying your AI model training jobs as Kubernetes Jobs, Deployments, or StatefulSets depending on your requirements.

With Pulumi's infrastructure as code approach, you can easily replicate this setup in other environments, share it with team members, or version control it alongside your application code.