GPU-Accelerated Workloads in AKS
PythonTo deploy GPU-accelerated workloads on Azure Kubernetes Service (AKS), you need to create an AKS cluster with a node pool that includes GPU-enabled virtual machines. In Pulumi, you accomplish this by using the
azure-native.containerservice.ManagedCluster
to create a Kubernetes cluster and theazure-native.containerservice.AgentPool
to define the node pool necessary for GPU workloads.Azure provides various types of virtual machines that are suitable for GPU-accelerated workloads, such as the NV series VMs. When creating the node pool, specify an appropriate VM size (e.g.,
Standard_NC6
) that comes equipped with GPUs.Here is a step-by-step Python program in Pulumi to create an AKS cluster with a node pool suitable for GPU-accelerated workloads:
-
Import necessary Pulumi modules.
-
Create a new AKS cluster.
-
Create a new agent pool with GPU-enabled virtual machines. You would typically select a VM size that supports GPUs, such as
Standard_NC6
. -
Export the required outputs, such as the Kubernetes cluster name and the kubeconfig needed to access the cluster.
Let's move on to the program:
import pulumi import pulumi_azure_native as azure_native # Create a new resource group to contain the AKS cluster resource_group = azure_native.resources.ResourceGroup("gpu_workloads_rg") # Define the settings for the AKS cluster managed_cluster = azure_native.containerservice.ManagedCluster( "gpu_workloads_cluster", resource_group_name=resource_group.name, dns_prefix="gpuk8s", # Replace with your DNS prefix agent_pool_profiles=[{ "count": 1, # Number of nodes in the GPU node pool "vm_size": "Standard_NC6", # GPU-enabled VM size "name": "gpunodepool", "os_type": "Linux", "mode": "User", # Indicates that this is a node pool for user workloads }], identity={ "type": "SystemAssigned", }, # Additional cluster settings go here... ) # Define the node pool for GPU workloads gpu_node_pool = azure_native.containerservice.AgentPool( "gpu_node_pool", agent_pool_name="gpupool", # Name of the GPU node pool mode="User", # User node pool, not used for system services os_type="Linux", vm_size="Standard_NC6", # GPU-enabled VM size count=1, # Number of nodes resource_group_name=resource_group.name, cluster_name=managed_cluster.name, # Customizable settings like scaling, taints, and labels can be specified here... ) # Export the cluster name and kubeconfig for accessing the cluster pulumi.export("cluster_name", managed_cluster.name) kubeconfig = pulumi.Output.all(resource_group.name, managed_cluster.name).apply( lambda args: azure_native.containerservice.list_managed_cluster_user_credentials( resource_group_name=args[0], resource_name=args[1], ).kubeconfigs[0].value.apply( lambda enc: pulumi.Output.secret(enc.decode("utf-8")) ) ) pulumi.export("kubeconfig", kubeconfig)
In the program above:
-
We created a new resource group to host the AKS cluster. You can alternatively use an existing resource group.
-
We then set up an AKS cluster with an initial agent pool configuration. The pool is defined with GPU-enabled virtual machines by specifying the
vm_size
. -
We have a dedicated node pool specifically configured with GPU resources. Again, we specify the VM size that includes a GPU. You can adjust the
count
parameter to have more than one node if you need more GPU resources. -
Finally, we export the
cluster_name
andkubeconfig
to allow access to the AKS cluster. Thekubeconfig
is marked as a secret because it contains sensitive information used to authenticate to the cluster.
For any AKS deployment, you might also need to consider other settings such as networking, storage, monitoring, and autoscaling, depending on your application requirements. The above program can be augmented with additional properties to manage these aspects.
-