Reliability Testing of Auto-Scaling Policies for ML Services

Question

Pulumi · Accepted Answer

Reliability testing of auto-scaling policies for Machine Learning (ML) services typically involves creating scenarios where the ML service needs to handle varying levels of load and ensuring that the infrastructure scales appropriately to handle that load without degradation in performance or service availability.

To implement reliability testing for auto-scaling policies in the context of Pulumi, you would set up ML services with auto-scaling capabilities through the cloud provider of your choice (such as AWS, Azure, GCP, etc.) and define auto-scaling policies that meet your requirements. This setup would include your ML model, a compute service to serve predictions, and an auto-scaling policy that defines how the service should scale in response to metrics like CPU utilization, request rates, or custom metrics that are significant for your ML workload.

Let's go through a hypothetical scenario where you want to set up and test the reliability of auto-scaling policies for an ML service on AWS. We'll use Amazon SageMaker as the ML service and rely on AWS application auto-scaling policies to manage the scaling. Here's a Pulumi Python program that accomplishes the setup of such a service, which you can later test by generating varying loads:

import pulumi
import pulumi_aws as aws

# Name of the SageMaker model
model_name = "my-ml-model"

# Create a SageMaker model resource
sagemaker_model = aws.sagemaker.Model(model_name,
    execution_role_arn=aws_iam_role["sagemaker_execution_role"].arn,
    primary_container={
        "image": "<your-ml-model-image>", # Replace with your ML model image
        "modelDataUrl": "<your-ml-model-data-url>" # Replace with your ML model data URL
    })

# Define an endpoint configuration for the SageMaker model
endpoint_config = aws.sagemaker.EndpointConfiguration(f"{model_name}-endpoint-config",
    production_variants=[{
        "variantName": "variant-1",
        "modelName": sagemaker_model.name,
        "initialInstanceCount": 1,
        "instanceType": "ml.m4.xlarge"
    }])

# Create a SageMaker endpoint using the endpoint configuration
sagemaker_endpoint = aws.sagemaker.Endpoint(f"{model_name}-endpoint",
    endpoint_config_name=endpoint_config.name)

# Define auto-scaling policy for the SageMaker endpoint variant
scaling_policy = aws.appautoscaling.Policy(f"{model_name}-auto-scaling-policy",
    resource_id=pulumi.Output.concat("endpoint/", sagemaker_endpoint.name, "/variant/", "variant-1"),
    scalable_dimension="sagemaker:variant:DesiredInstanceCount",
    service_namespace="sagemaker",
    target_tracking_scaling_policy_configuration={
        "targetValue": 75.0, # Target value for the custom metric
        "customizedMetricSpecification": {
            "metricName": "SageMakerVariantInvocationsPerInstance",
            "namespace": "AWS/SageMaker",
            "statistic": "Average",
            "unit": "Count",
            "dimensions": [{
                "name": "EndpointName",
                "value": sagemaker_endpoint.name
            }, {
                "name": "VariantName",
                "value": "variant-1"
            }]
        },
        "scaleInCooldown": 60,
        "scaleOutCooldown": 60
    })

# Export the SageMaker endpoint name and auto-scaling policy ARN
pulumi.export('sagemaker_endpoint_name', sagemaker_endpoint.name)
pulumi.export('scaling_policy_arn', scaling_policy.arn)

In this program, we start by defining a SageMaker model with an execution role and the necessary container configuration, including where to fetch the model data. We then create an endpoint configuration where we specify the model, instance count, and instance type. Following this, we create a SageMaker endpoint.

For the auto-scaling part, we use the aws.appautoscaling.Policy resource from the pulumi_aws Pulumi provider, which allows us to define a scaling policy tied to the SageMaker endpoint variant.

The target_tracking_scaling_policy_configuration section of this resource is where we set up the auto-scaling logic. We define a target value (for example, 75% for a custom metric like the number of invocations per instance), the metric to track, cooldown periods for scaling in and scaling out, and the required dimensions for the metric.

This configuration ensures that as the number of invocations per instance exceeds the target value, the application autoscaling policy will provision more instances to handle the load, and similarly, scale in when the invocations drop below the target.

Finally, we export the created SageMaker endpoint name and the ARN (Amazon Resource Name) of the scaling policy so that they can be referenced later or used in other parts of our Pulumi program or in the Pulumi Console.

To perform reliability testing, you'd subject this service to load that mimics expected production patterns, and observe and evaluate how the auto-scaling policy responds. You'd want to do this in a controlled environment where you can simulate load increases and decreases and ensure the model scales as expected without any downtime or performance issues.