1. GPU-Accelerated Workloads in AKS


    To deploy GPU-accelerated workloads on Azure Kubernetes Service (AKS), you need to create an AKS cluster with a node pool that includes GPU-enabled virtual machines. In Pulumi, you accomplish this by using the azure-native.containerservice.ManagedCluster to create a Kubernetes cluster and the azure-native.containerservice.AgentPool to define the node pool necessary for GPU workloads.

    Azure provides various types of virtual machines that are suitable for GPU-accelerated workloads, such as the NV series VMs. When creating the node pool, specify an appropriate VM size (e.g., Standard_NC6) that comes equipped with GPUs.

    Here is a step-by-step Python program in Pulumi to create an AKS cluster with a node pool suitable for GPU-accelerated workloads:

    1. Import necessary Pulumi modules.

    2. Create a new AKS cluster.

    3. Create a new agent pool with GPU-enabled virtual machines. You would typically select a VM size that supports GPUs, such as Standard_NC6.

    4. Export the required outputs, such as the Kubernetes cluster name and the kubeconfig needed to access the cluster.

    Let's move on to the program:

    import pulumi import pulumi_azure_native as azure_native # Create a new resource group to contain the AKS cluster resource_group = azure_native.resources.ResourceGroup("gpu_workloads_rg") # Define the settings for the AKS cluster managed_cluster = azure_native.containerservice.ManagedCluster( "gpu_workloads_cluster", resource_group_name=resource_group.name, dns_prefix="gpuk8s", # Replace with your DNS prefix agent_pool_profiles=[{ "count": 1, # Number of nodes in the GPU node pool "vm_size": "Standard_NC6", # GPU-enabled VM size "name": "gpunodepool", "os_type": "Linux", "mode": "User", # Indicates that this is a node pool for user workloads }], identity={ "type": "SystemAssigned", }, # Additional cluster settings go here... ) # Define the node pool for GPU workloads gpu_node_pool = azure_native.containerservice.AgentPool( "gpu_node_pool", agent_pool_name="gpupool", # Name of the GPU node pool mode="User", # User node pool, not used for system services os_type="Linux", vm_size="Standard_NC6", # GPU-enabled VM size count=1, # Number of nodes resource_group_name=resource_group.name, cluster_name=managed_cluster.name, # Customizable settings like scaling, taints, and labels can be specified here... ) # Export the cluster name and kubeconfig for accessing the cluster pulumi.export("cluster_name", managed_cluster.name) kubeconfig = pulumi.Output.all(resource_group.name, managed_cluster.name).apply( lambda args: azure_native.containerservice.list_managed_cluster_user_credentials( resource_group_name=args[0], resource_name=args[1], ).kubeconfigs[0].value.apply( lambda enc: pulumi.Output.secret(enc.decode("utf-8")) ) ) pulumi.export("kubeconfig", kubeconfig)

    In the program above:

    • We created a new resource group to host the AKS cluster. You can alternatively use an existing resource group.

    • We then set up an AKS cluster with an initial agent pool configuration. The pool is defined with GPU-enabled virtual machines by specifying the vm_size.

    • We have a dedicated node pool specifically configured with GPU resources. Again, we specify the VM size that includes a GPU. You can adjust the count parameter to have more than one node if you need more GPU resources.

    • Finally, we export the cluster_name and kubeconfig to allow access to the AKS cluster. The kubeconfig is marked as a secret because it contains sensitive information used to authenticate to the cluster.

    For any AKS deployment, you might also need to consider other settings such as networking, storage, monitoring, and autoscaling, depending on your application requirements. The above program can be augmented with additional properties to manage these aspects.