Autoscaling ML Workloads with ServiceMonitor Metrics

Question

Pulumi · Accepted Answer

Autoscaling machine learning (ML) workloads is essential for managing resources efficiently and ensuring that your ML models have enough computing power when needed, while minimizing costs during periods of low usage. To achieve this, you'll typically monitor certain metrics that indicate the workload's performance and demand, and scale your resources based on those metrics.

In Kubernetes, the `HorizontalPodAutoscaler` is a resource that automatically scales the number of pod replicas in a deployment or replicaset based on observed CPU utilization or custom metrics. When working with ML workloads, you might want to autoscale based on custom metrics, perhaps those exposed by a `ServiceMonitor` if you're using Prometheus for monitoring. This typically requires setting up custom metrics support for the horizontal pod autoscaler in your Kubernetes cluster.

Below is a Pulumi program using Python that demonstrates how you would define a `HorizontalPodAutoscaler` that scales based on custom metrics from a `ServiceMonitor`. This example assumes that you have a functioning Kubernetes cluster with Prometheus and the custom metrics API installed and configured.

```python
import pulumi
import pulumi_kubernetes as k8s

# The name of your Kubernetes deployment that you want to autoscale.
deployment_name = 'ml-workload-deployment'

# Specification for the HorizontalPodAutoscaler that uses custom metrics.
hpa = k8s.autoscaling.v2beta2.HorizontalPodAutoscaler(
    'ml-workload-hpa',
    metadata=k8s.meta.v1.ObjectMetaArgs(
        name='ml-workload-hpa',
        namespace='default', # Adjust the namespace according to your setup.
    ),
    spec=k8s.autoscaling.v2beta2.HorizontalPodAutoscalerSpecArgs(
        scale_target_ref=k8s.autoscaling.v2beta2.CrossVersionObjectReferenceArgs(
            kind='Deployment',
            name=deployment_name,
            api_version='apps/v1',
        ),
        min_replicas=1, # Minimum number of replicas.
        max_replicas=10, # Maximum number of replicas.
        metrics=[k8s.autoscaling.v2beta2.MetricSpecArgs(
            type='Object', # For custom metrics, use type "Object."
            object=k8s.autoscaling.v2beta2.ObjectMetricSourceArgs(
                metric=k8s.autoscaling.v2beta2.MetricIdentifierArgs(
                    name='service_monitor_metric_name', # Replace with your ServiceMonitor metric name.
                    selector=k8s.meta.v1.LabelSelectorArgs(
                        match_labels={
                            'key': 'value', # Specify the labels that the ServiceMonitor uses.
                        },
                    ),
                ),
                target=k8s.autoscaling.v2beta2.MetricTargetArgs(
                    type='Value', # Use 'Value' or 'AverageValue' based on the metric type.
                    value='100', # Specify the target value for your custom metric.
                ),
                described_object=k8s.autoscaling.v2beta2.CrossVersionObjectReferenceArgs(
                    kind='Service',
                    name='ml-workload-service', # Name of the service that the metric is coming from.
                    api_version='v1',
                ),
            ),
        )],
    ),
)

# Export the name of the HPA
pulumi.export('hpa_name', hpa.metadata.apply(lambda metadata: metadata.name))

```

This code does the following steps:
- Imports the necessary Pulumi modules.
- Creates a `HorizontalPodAutoscaler` named `ml-workload-hpa` that targets a deployment called `ml-workload-deployment`.
- Configures the autoscaler to have a minimum of 1 replica and a maximum of 10 replicas.
- Specifies the metrics to be used for autoscaling. In this case, it's a custom metric named `service_monitor_metric_name` from a `Service` named `ml-workload-service`.
- Exports the name of the `HorizontalPodAutoscaler` so that you can reference it externally if needed.

Make sure to replace `'service_monitor_metric_name'`, `'ml-workload-service'`, and any other placeholder with the appropriate names based on your specific use case and cluster setup.

This autoscaling configuration is crucial for ML workloads where the compute needs may vary significantly. For instance, a common pattern is to anticipate high load during business hours when models are being trained and less load at other times. Custom metrics can provide more fine-grained control over autoscaling behavior by scaling based on the actual demand of your ML workloads.