GPU-Accelerated Workloads in AKS

Question

Pulumi · Accepted Answer

To deploy GPU-accelerated workloads on Azure Kubernetes Service (AKS), you need to create an AKS cluster with a node pool that includes GPU-enabled virtual machines. In Pulumi, you accomplish this by using the `azure-native.containerservice.ManagedCluster` to create a Kubernetes cluster and the `azure-native.containerservice.AgentPool` to define the node pool necessary for GPU workloads.

Azure provides various types of virtual machines that are suitable for GPU-accelerated workloads, such as the NV series VMs. When creating the node pool, specify an appropriate VM size (e.g., `Standard_NC6`) that comes equipped with GPUs.

Here is a step-by-step Python program in Pulumi to create an AKS cluster with a node pool suitable for GPU-accelerated workloads:

1. Import necessary Pulumi modules.

2. Create a new AKS cluster.

3. Create a new agent pool with GPU-enabled virtual machines. You would typically select a VM size that supports GPUs, such as `Standard_NC6`.

4. Export the required outputs, such as the Kubernetes cluster name and the kubeconfig needed to access the cluster.

Let's move on to the program:

```python
import pulumi
import pulumi_azure_native as azure_native

# Create a new resource group to contain the AKS cluster
resource_group = azure_native.resources.ResourceGroup("gpu_workloads_rg")

# Define the settings for the AKS cluster
managed_cluster = azure_native.containerservice.ManagedCluster(
    "gpu_workloads_cluster",
    resource_group_name=resource_group.name,
    dns_prefix="gpuk8s",  # Replace with your DNS prefix
    agent_pool_profiles=[{
        "count": 1,  # Number of nodes in the GPU node pool
        "vm_size": "Standard_NC6",  # GPU-enabled VM size
        "name": "gpunodepool",
        "os_type": "Linux",
        "mode": "User",  # Indicates that this is a node pool for user workloads
    }],
    identity={
        "type": "SystemAssigned",
    },
    # Additional cluster settings go here...
)

# Define the node pool for GPU workloads
gpu_node_pool = azure_native.containerservice.AgentPool(
    "gpu_node_pool",
    agent_pool_name="gpupool",  # Name of the GPU node pool
    mode="User",  # User node pool, not used for system services
    os_type="Linux",
    vm_size="Standard_NC6",  # GPU-enabled VM size
    count=1,  # Number of nodes
    resource_group_name=resource_group.name,
    cluster_name=managed_cluster.name,
    # Customizable settings like scaling, taints, and labels can be specified here...
)

# Export the cluster name and kubeconfig for accessing the cluster
pulumi.export("cluster_name", managed_cluster.name)
kubeconfig = pulumi.Output.all(resource_group.name, managed_cluster.name).apply(
    lambda args: azure_native.containerservice.list_managed_cluster_user_credentials(
        resource_group_name=args[0],
        resource_name=args[1],
    ).kubeconfigs[0].value.apply(
        lambda enc: pulumi.Output.secret(enc.decode("utf-8"))
    )
)
pulumi.export("kubeconfig", kubeconfig)
```

In the program above:

- We created a new resource group to host the AKS cluster. You can alternatively use an existing resource group.

- We then set up an AKS cluster with an initial agent pool configuration. The pool is defined with GPU-enabled virtual machines by specifying the `vm_size`.

- We have a dedicated node pool specifically configured with GPU resources. Again, we specify the VM size that includes a GPU. You can adjust the `count` parameter to have more than one node if you need more GPU resources.

- Finally, we export the `cluster_name` and `kubeconfig` to allow access to the AKS cluster. The `kubeconfig` is marked as a secret because it contains sensitive information used to authenticate to the cluster.

For any AKS deployment, you might also need to consider other settings such as networking, storage, monitoring, and autoscaling, depending on your application requirements. The above program can be augmented with additional properties to manage these aspects.