1. Auto-Scaling AI Workloads with Kubernetes Horizontal Pod Autoscaler


    The Kubernetes Horizontal Pod Autoscaler (HPA) allows you to automatically scale the number of pods in a replication controller, deployment, replica set, or stateful set based on observed CPU utilization or other select metrics. This can be particularly useful for workloads such as AI tasks which are computationally expensive and might have varying resource requirements over time.

    In Pulumi, you can declare resources as classes in a Python program. For managing Kubernetes resources, Pulumi provides an SDK that maps directly to Kubernetes API objects. To leverage an HPA, you would typically define a Kubernetes Deployment for your AI workload and then an HorizontalPodAutoscaler resource to manage scaling that deployment.

    Below is a Pulumi program written in Python that demonstrates how to deploy an AI workload and configure auto-scaling using the HPA. This particular program uses the pulumi_kubernetes Python package.

    import pulumi import pulumi_kubernetes as k8s # Define the Kubernetes Deployment for the AI workload ai_workload = k8s.apps.v1.Deployment( "ai-workload", spec=k8s.apps.v1.DeploymentSpecArgs( selector=k8s.meta.v1.LabelSelectorArgs(match_labels={"app": "ai-workload"}), replicas=1, template=k8s.core.v1.PodTemplateSpecArgs( metadata=k8s.meta.v1.ObjectMetaArgs(labels={"app": "ai-workload"}), spec=k8s.core.v1.PodSpecArgs( containers=[ k8s.core.v1.ContainerArgs( name="ai-container", image="gcr.io/my-ai-project/ai-service:latest", resources=k8s.core.v1.ResourceRequirementsArgs( requests={"cpu": "500m", "memory": "500Mi"}, limits={"cpu": "1000m", "memory": "1000Mi"} ) ) ] ) ) ) ) # Define the Horizontal Pod Autoscaler that targets the AI workload Deployment ai_workload_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler( "ai-workload-hpa", spec=k8s.autoscaling.v1.HorizontalPodAutoscalerSpecArgs( scale_target_ref=k8s.autoscaling.v1.CrossVersionObjectReferenceArgs( api_version="apps/v1", kind="Deployment", name=ai_workload.metadata["name"] ), min_replicas=1, max_replicas=10, target_cpu_utilization_percentage=80 ) ) # Export the name of the workload and the HPA pulumi.export("workload_name", ai_workload.metadata["name"]) pulumi.export("hpa_name", ai_workload_hpa.metadata["name"])

    What this program does:

    1. It defines a Kubernetes Deployment called ai_workload. This deployment specifies the desired state of your AI service, like the number of replicas, the container image to use, and resource requirements.

    2. Then it defines a HorizontalPodAutoscaler called ai_workload_hpa. The HPA automatically adjusts the number of pods as needed to maintain an average CPU utilization across all pods of 80%. It does that by scaling the number of replicas in the deployment between a minimum of 1 and a maximum of 10.

    The min_replicas and max_replicas specify the scale range, and the target_cpu_utilization_percentage sets the target CPU utilization percentage that triggers the scaling.

    The exported values at the end are useful for confirming the names of the deployed resources and might be used as identifiers in other Pulumi stacks or for querying the Kubernetes cluster about these resources.

    Remember to adjust the container image, requests, and limits to fit your actual AI workload requirement as well as to modify min_replicas, max_replicas, and target_cpu_utilization_percentage to match your scaling preferences and the expected load pattern.

    When you run the program with Pulumi, these resources would get deployed to your Kubernetes cluster. Pulumi would also provide you with outputs after the run, showing the workload_name and hpa_name, which can be useful for tracking and managing your resources.

    For further reading on HorizontalPodAutoscaler and how to configure it, please refer to the HorizontalPodAutoscaler v1 documentation.