Multi-tenant AI Model Serving with Argo and Kubernetes

Question

Pulumi · Accepted Answer

To achieve multi-tenant AI model serving with Argo and Kubernetes using Pulumi, we’ll need to create a Kubernetes cluster, deploy Argo Workflows, and set up a way to serve different AI models. In this scenario, we will focus on the infrastructure setup using Pulumi with an example in Python.

The basic workflow will be as follows:

Create a Kubernetes cluster where our workloads will run. For this example, we’ll use Azure Kubernetes Service (AKS) as it offers managed Kubernetes clusters that abstract away some of the more complicated aspects of Kubernetes setup.
With the Kubernetes cluster in place, we will then use Pulumi to deploy Argo Workflows, which is a Kubernetes-native workflow engine for orchestrating parallel jobs on Kubernetes.
Finally, we’ll demonstrate how you might set up a model serving tool like Seldon or KFServing, although to keep things simple we won’t dive into the specifics of AI model serving. In a real-world scenario, you'd likely use containerized models served via a tool designed for this purpose.

Let me guide you through the Pulumi code required to set up the infrastructure for this.

Firstly, we need to set up a Kubernetes cluster on Azure. We can leverage the Azure Native Pulumi provider to create an AKS cluster (ManagedCluster). For that purpose, we need to define a resource group that will contain our AKS cluster and a service principal which will be used by the AKS cluster to interact with other Azure services.

We will not cover the Argo Workflows setup or the model serving configuration in detail, but instructions for these can typically be found in the respective official documentation. Once the infrastructure is ready, you would use Kubernetes resources such as Deployments, Services, and Ingress to deploy your ML models and expose them as necessary.

Here's a basic Pulumi program that creates an AKS cluster to get you started:

import pulumi
from pulumi_azure_native import resources
from pulumi_azure_native import containerservice
from pulumi_azure_native import authorization
from pulumi_azure_native import network

# Create an Azure Resource Group
resource_group = resources.ResourceGroup("aiResourceGroup")

# Create an Azure Virtual Network for the AKS Cluster
vnet = network.VirtualNetwork(
    "aiVnet",
    resource_group_name=resource_group.name,
    address_space=network.AddressSpaceArgs(
        address_prefixes=["10.2.0.0/16"],
    ),
)

# Create a Subnet for the AKS Cluster
subnet = network.Subnet(
    "aiSubnet",
    resource_group_name=resource_group.name,
    address_prefix="10.2.1.0/24",
    virtual_network_name=vnet.name
)

# Create the AKS cluster
aks_cluster = containerservice.ManagedCluster(
    "aiManagedCluster",
    resource_group_name=resource_group.name,
    agent_pool_profiles=[{
        "count": 1,
        "max_pods": 110,
        "mode": "System",
        "name": "agentpool",
        "node_count": 2,
        "os_disk_size_gb": 30,
        "os_type": "Linux",
        "type": "VirtualMachineScaleSets",
        "vm_size": "Standard_DS2_v2",
    }],
    dns_prefix="ai-k8s",
    enable_rbac=True,
    kubernetes_version="1.20.7",
    linux_profile=containerservice.ContainerServiceLinuxProfileArgs(
        admin_username="adminuser",
        ssh=containerservice.ContainerServiceSshConfigurationArgs(
            public_keys=[
                containerservice.ContainerServiceSshPublicKeyArgs(
                    key_data="<YOUR_SSH_PUBLIC_KEY>"
                )
            ]
        )
    ),
    network_profile=containerservice.ContainerServiceNetworkProfileArgs(
        network_plugin="azure",
        service_cidr="10.2.2.0/24",
        dns_service_ip="10.2.2.10",
        docker_bridge_cidr="172.17.0.1/16",
        load_balancer_sku="standard",
        vnet_subnet_id=subnet.id
    ),
    service_principal_profile=containerservice.ManagedClusterServicePrincipalProfileArgs(
        client_id="<Your_Service_Principal_Client_ID>",
        secret="<Your_Service_Principal_Secret>",
    ),
)

pulumi.export('kubeconfig', aks_cluster.kube_config)

Please replace <YOUR_SSH_PUBLIC_KEY>, <Your_Service_Principal_Client_ID>, and <Your_Service_Principal_Secret> with the actual values you want to use.

The code above will provide you with the base Kubernetes cluster. Next steps would involve applying kubectl commands to set up Argo Workflows and your AI models. You would typically encapsulate deployment instructions for Argo and models within Kubernetes YAML manifests or helm charts once the cluster is ready.

Remember that managing multi-tenant services can be complex, involving careful design of networking, storage, and compute resources, along with appropriate security, logging, and monitoring services. The provided code is a starting point, and additional work would be needed on top of this to cater to those concerns.