Auto-Scaling ML Pipelines with Kubernetes HPA
PythonTo set up auto-scaling for machine learning (ML) pipelines in a Kubernetes environment, we can leverage the Horizontal Pod Autoscaler (HPA). Kubernetes HPA automatically scales the number of pod replicas in a deployment or replicaset based on observed CPU utilization or other select metrics.
The HPA adjusts the number of replicas in a deployment or replicaset to match the observed average CPU utilization to the target specified by the user.
Below, I am going to illustrate a Pulumi program in Python that defines an ML pipeline deployment and a corresponding HPA resource. The program will:
- Create a Kubernetes Deployment that defines the desired state of the ML pipeline, including the container image to run.
- Define an HPA resource associated with the Deployment created in the first step. The HPA will target a certain average CPU utilization across all pods controlled by the deployment, scaling the number of replicas up or down based on this target.
Assumptions:
- You have a Kubernetes cluster up and running.
- Pulumi CLI and the necessary cloud providers are already set up.
- The Docker image for the ML pipeline is available (replaced with
your-ml-pipeline-image:latest
in the code). - You have configured
kubectl
to communicate with your Kubernetes cluster.
Now, let's start with the Pulumi program:
import pulumi import pulumi_kubernetes as k8s # Define the ML pipeline deployment. ml_pipeline_deployment = k8s.apps.v1.Deployment( "ml-pipeline-deployment", spec=k8s.apps.v1.DeploymentSpecArgs( replicas=2, # Starting number of replicas selector=k8s.meta.v1.LabelSelectorArgs( match_labels={"app": "ml-pipeline"}, ), template=k8s.core.v1.PodTemplateSpecArgs( metadata=k8s.meta.v1.ObjectMetaArgs( labels={"app": "ml-pipeline"}, ), spec=k8s.core.v1.PodSpecArgs( containers=[ k8s.core.v1.ContainerArgs( name="ml-pipeline-container", image="your-ml-pipeline-image:latest", # Replace with your actual image resources=k8s.core.v1.ResourceRequirementsArgs( requests={ "cpu": "500m", # Requested CPU resources }, limits={ "cpu": "1000m", # CPU resource limits }, ), ), ], ), ), ), ) # Define the Horizontal Pod Autoscaler. ml_pipeline_hpa = k8s.autoscaling.v1.HorizontalPodAutoscaler( "ml-pipeline-hpa", spec=k8s.autoscaling.v1.HorizontalPodAutoscalerSpecArgs( scale_target_ref=k8s.autoscaling.v1.CrossVersionObjectReferenceArgs( api_version="apps/v1", kind="Deployment", name=ml_pipeline_deployment.metadata.name, ), min_replicas=1, # Minimum number of replicas max_replicas=10, # Maximum number of replicas target_cpu_utilization_percentage=80, # Target CPU utilization percentage ), metadata=k8s.meta.v1.ObjectMetaArgs( name="ml-pipeline-hpa", labels={"app": "ml-pipeline"}, ), ) # Export the name of the deployment pulumi.export("ml_pipeline_deployment_name", ml_pipeline_deployment.metadata.name) # Export the name of the HPA pulumi.export("ml_pipeline_hpa_name", ml_pipeline_hpa.metadata.name)
Here's what each part of this Pulumi program is doing:
-
The
ml_pipeline_deployment
resource creates a Kubernetes Deployment that specifies the desired state for the ML pipeline. We start with two replicas, and the pods have theapp: ml-pipeline
label so that they can be identified by the HPA. -
The container in the deployment specifies resource requests and limits for CPU resources. Requests are guarantees for resources reserved for the container, while limits define the maximum resources a container can use.
-
The
ml_pipeline_hpa
resource defines a Horizontal Pod Autoscaler that targets the deployment we created above. Thescale_target_ref
links the HPA to the deployment using its API version, kind, and name. -
The
min_replicas
andmax_replicas
fields specify the lower and upper bounds for the number of replicas the HPA can scale to. -
The
target_cpu_utilization_percentage
is the average CPU utilization that the HPA tries to maintain across all the pods; in this case, it's set to 80%.
Finally, we export the names of the deployment and the HPA so they can be easily retrieved from the Pulumi stack.
To run this program, save it in a file (for example
autoscaling_ml_pipeline.py
), then execute it with Pulumi CLI tooling. Assuming you've named your Pulumi project and your k8s cluster is correctly configured, this should set up the autoscaling for your ML pipeline.