1. Configuring Azure Kubernetes Service (AKS) for AI Workloads via Portal


    To configure Azure Kubernetes Service (AKS) for AI (Artificial Intelligence) workloads, you would typically need to consider several factors to ensure your cluster is set up correctly to handle your AI applications. This includes setting up sufficient compute resources, installing necessary services, and enabling the right network configurations for optimal performance and security.

    In the context of using Pulumi for infrastructure as code (IaC), rather than configuring AKS manually through the Azure Portal, you would define your infrastructure programmatically in a language like Python using the azure-native package. This allows for repeatable and consistent deployments.

    I will provide you with a Pulumi program written in Python that demonstrates how to create an AKS cluster that could be used for AI workloads. The code will include comments explaining each part of the process. This Pulumi program will cover:

    • Creating an AKS cluster with a default node pool.
    • Configuring the Kubernetes version.
    • Setting up network profiles for Kubernetes.
    • Enabling RBAC (Role-Based Access Control).
    • Integrating Azure Container Registry if needed for storing and managing private Docker container images used in AI workloads.

    Below is the Python program for creating an AKS cluster using Pulumi:

    import pulumi import pulumi_azure_native as azure_native # Define the resource group in which all resources will be created resource_group = azure_native.resources.ResourceGroup('ai-workloads-rg') # Create an AKS managed cluster managed_cluster = azure_native.containerservice.ManagedCluster( "ai-workloads-aks", resource_group_name=resource_group.name, location=resource_group.location, # Define the properties for the AKS cluster suitable for AI workloads identity=azure_native.containerservice.ManagedClusterIdentity( type="SystemAssigned" ), agent_pool_profiles=[ azure_native.containerservice.ManagedClusterAgentPoolProfileArgs( name="defaultpool", # Specify the VM size depending on your workload needs vm_size="Standard_NC6", # Example VM size optimized for compute-intensive workloads like AI count=3, # Number of VMs in the node pool os_type="Linux", mode="System", ) ], linux_profile=azure_native.containerservice.ContainerServiceLinuxProfileArgs( admin_username="azureuser", ssh=azure_native.containerservice.SshConfigurationArgs( public_keys=[ # Replace '<ssh-rsa-key>' with your actual SSH RSA public key azure_native.containerservice.SshPublicKeyArgs( key_data="<ssh-rsa-key>" ) ] ) ), # Configure networking features such as DNS, routes, and ports network_profile=azure_native.containerservice.ContainerServiceNetworkProfileArgs( network_plugin="kubenet" # or "azure" for Azure CNI ), enable_rbac=True, # Enable RBAC for security best practices kubernetes_version="1.21.2", # Specify the desired Kubernetes version ) # If using private container images, create an Azure Container Registry container_registry = azure_native.containerregistry.Registry( "ai-workloads-acr", resource_group_name=resource_group.name, location=resource_group.location, sku=azure_native.containerregistry.SkuArgs( name="Basic", # Choose between Basic, Standard, and Premium based on needs ), admin_user_enabled=True # Enable admin user for simplicity (consider disabling for production) ) # Export the cluster name and the registry login server, which will be needed to interact with ACR pulumi.export('cluster_name', managed_cluster.name) pulumi.export('registry_login_server', container_registry.login_server)

    In this program, we create an AKS cluster and an Azure Container Registry (ACR) within a resource group. We configure a node pool with virtual machines (VMs) that are suitable for AI workloads; an example VM size used here is Standard_NC6, which is optimized for compute-intensive tasks.

    We specify the desired Kubernetes version and enable RBAC for security. Networking is set up to use the default "kubenet" plugin, which can be changed to Azure CNI by replacing it with "azure". Adding SSH keys allows secure access to the nodes for maintenance purposes.

    For deploying AI workloads that may require private container images, we've included the setup for an ACR with a basic SKU. Admin user access is enabled for ease of use, but you should consider security best practices for production environments.

    This program needs to be executed using the Pulumi CLI. After running the program, Pulumi will manage the provisioning of the AKS cluster and ACR as defined above.

    Please replace "<ssh-rsa-key>" with your actual SSH RSA public key and run the program using the Pulumi CLI to create the infrastructure. Once the deployment is finished, you can use the AKS cluster for deploying your AI workloads. The output will include the AKS cluster name and ACR login server URL, which you can use to push Docker images and interact with your ACR.