1. Time-based Auto Scaling for Inference Workloads


    Time-based auto-scaling is a method used to automatically adjust the number of compute resources in your cloud infrastructure based on expected changes in demand at specific times. This can be especially useful for inference workloads that may experience predictable patterns of high and low usage, such as those generated by machine learning models that serve real-time predictions.

    When setting up time-based auto-scaling for inference workloads, you typically define scaling policies that control how and when to scale out (add more resources) or scale in (remove resources). You may specify schedules corresponding to anticipated workload changes, such as scaling out in preparation for peak demand times and scaling in during off-peak hours to save on costs.

    In the context of Pulumi, this would involve using cloud provider services that support auto-scaling and defining resources and policies that implement these scaling parameters. Below is a Pulumi program written in Python that sets up an auto-scaling policy for a hypothetical inference workload on AWS. We'll be utilizing AWS SageMaker, a fully managed service that provides the ability to build, train, and deploy machine learning models.

    This program demonstrates how to create a SageMaker endpoint with an auto-scaling policy that adjusts the instance count within the endpoint's configuration based on a schedule:

    import pulumi import pulumi_aws as aws # Create a SageMaker model. For the purpose of this example, we assume that # you have already created a SageMaker model resource. # Replace 'MY-SAGEMAKER-MODEL-NAME' with your actual SageMaker model name. sagemaker_model = aws.sagemaker.Model("MySageMakerModel", ...) # Define the SageMaker endpoint configuration, including the initial instance count. endpoint_config = aws.sagemaker.EndpointConfiguration( "MyEndpointConfig", production_variants=[{ "variantName": "variant-1", "modelName": sagemaker_model.name, "instanceType": "ml.m4.xlarge", "initialInstanceCount": 1, }] ) # Create a SageMaker endpoint using the configuration defined above. endpoint = aws.sagemaker.Endpoint( "MyEndpoint", endpoint_config_name=endpoint_config.name, tags={ "Environment": "prod", "Purpose": "Inference", } ) # Define an auto-scaling policy with time-based scaling actions. autoscaling_policy = aws.applicationautoscaling.Policy( "MyAutoScalingPolicy", resource_id=pulumi.Output.concat("endpoint/", endpoint.endpoint_arn.suffix, "/variant/", "variant-1"), scalable_dimension="sagemaker:variant:DesiredInstanceCount", service_namespace="sagemaker", policy_type="StepScaling", step_scaling_policy_configuration={ "adjustmentType": "ChangeInCapacity", "cooldown": 300, # Cooldown period in seconds after a scaling activity. "stepAdjustments": [{ "scalingAdjustment": 1, # Number of instances to add or remove. "metricIntervalLowerBound": 0, # Lower bound for the metric interval. }], }, # Define the schedule for the scaling actions. scheduled_action={ "name": "ScaleOutForPeakTimes", "schedule": "cron(0 17 ? * MON-FRI *)", # Use a cron expression to define the schedule. "scalable_target_action": { "min_capacity": 2, # Minimum capacity for the scheduled scaling action. "max_capacity": 5, # Maximum capacity for the scheduled scaling action. }, } ) # Export the endpoint name and auto-scaling policy ID. pulumi.export("endpoint_name", endpoint.name) pulumi.export("autoscaling_policy_id", autoscaling_policy.id)

    In the program above:

    • We start by creating a SageMaker model placeholder (sagemaker_model) which would be replaced with your actual SageMaker model name.
    • We define a SageMaker Endpoint Configuration (MyEndpointConfig) with a production variant, specifying the model to use, instance type, and instance count.
    • We create a SageMaker endpoint (MyEndpoint) that serves the inference workload using the previously created endpoint configuration.
    • We set up an auto-scaling policy (MyAutoScalingPolicy) that targets the SageMaker endpoint's variant. The Policy resource manages the scaling actions, and we use a scheduled action to define the auto-scaling parameters based on time.
    • The scheduled_action property uses a cron expression to scale our endpoint out at specific times. This particular expression scales out every weekday at 5 PM.
    • We use a Step Scaling policy which adjusts the capacity based on a set of criteria defined in stepAdjustments.
    • Finally, we export the endpoint name and auto-scaling policy ID for easy reference.

    Please replace the placeholder values and configurations with the actual details that match your use case. Replace 'MY-SAGEMAKER-MODEL-NAME' with your SageMaker model's name, and adjust the cron expression in scheduled_action.schedule as needed based on your workload's peak times.