1. Auto-Scaling ML Workloads with Kubernetes Horizontal Pod Autoscaler

    Python

    The Kubernetes Horizontal Pod Autoscaler (HPA) automatically scales the number of pods in a deployment or replica set based on observed CPU utilization or with some other metric support. It's useful for scaling ML workloads that may have variable load and require more resources at times to maintain performance.

    Below I will guide you through the basics of how to set up auto-scaling for ML workloads using Pulumi and Kubernetes HPA. This program will create an HPA resource targeting a Kubernetes deployment specified by the user.

    The main components to set up auto-scaling for ML workloads are:

    • A Kubernetes Deployment that runs your ML application.
    • A Service to expose the application, if it needs to be accessible.
    • A HorizontalPodAutoscaler to automatically scale the deployment based on defined rules.

    In the Pulumi program, first, we ensure that we have a deployment ready that the HPA can target. We then define the HPA with the specific rules for scaling up or down.

    In this example, let's consider the scenario where an ML application is packaged in a Docker image, and we want to scale based on CPU utilization.

    Here is a Pulumi program that sets up auto-scaling for such a workload:

    import pulumi import pulumi_kubernetes as k8s from pulumi_kubernetes.autoscaling.v2beta2 import HorizontalPodAutoscaler # Assume that the user already has a deployment they want to scale. # This is how you would get a reference to an existing deployment in the namespace 'default'. app_name = 'ml-app' app_labels = {'app': app_name} existing_deployment = k8s.apps.v1.Deployment.get( 'existing-deployment', pulumi.ResourceOptions(), api_version='apps/v1', kind='Deployment', metadata={ 'namespace': 'default', 'name': app_name } ) # Define the HPA ml_app_hpa = HorizontalPodAutoscaler( 'ml-app-hpa', metadata=k8s.meta.v1.ObjectMetaArgs( name='ml-app-hpa', namespace='default' ), spec=k8s.autoscaling.v2beta2.HorizontalPodAutoscalerSpecArgs( max_replicas=10, # Maximum number of replicas to which the application can be scaled min_replicas=2, # Minimum number of replicas of the application scale_target_ref=k8s.autoscaling.v2beta2.CrossVersionObjectReferenceArgs( api_version='apps/v1', kind='Deployment', name=existing_deployment.metadata['name'] ), metrics=[k8s.autoscaling.v2beta2.MetricSpecArgs( type='Resource', resource=k8s.autoscaling.v2beta2.ResourceMetricSourceArgs( name='cpu', target=k8s.autoscaling.v2beta2.MetricTargetArgs( # Target percentage of CPU utilization at which to scale type='Utilization', average_utilization=50, ) ) )] ) ) pulumi.export('horizontal_pod_autoscaler_name', ml_app_hpa.metadata['name'])

    In this program:

    • We define app_name and app_labels, which are references to an existing deployment for an ML application you wish to auto-scale.
    • We use pulumi_kubernetes.apps.v1.Deployment.get to get an existing deployment by its name and namespace.
    • We set up HorizontalPodAutoscaler with a specification including min_replicas and max_replicas which define the minimum and maximum number of pods running.
    • Under metrics, we tell the autoscaler to target CPU utilization for our auto-scaling metric. The average_utilization set to 50 means the HPA will trigger scaling if the average CPU utilization across all pods goes above 50%.
    • A CrossVersionObjectReferenceArgs is used to reference the target deployment we want to scale based on the defined metrics.

    The final pulumi.export is used to output the name of the HPA for retrieval with Pulumi CLI.

    Make sure you have Pulumi CLI installed and the Kubernetes cluster context is correctly set for Pulumi to operate in. This Pulumi program should be run using Pulumi CLI commands pulumi up to create the resources, and pulumi destroy to clean them up. Remember that by setting up the Horizontal Pod Autoscaler, you're instructing Kubernetes to manage the number of pods dynamically, which is crucial to handle varying workloads for your ML applications efficiently.