Autoscaling Inference Endpoints with Amazon SageMaker

Question

Pulumi · Accepted Answer

Autoscaling Inference Endpoints in Amazon SageMaker allows you to automatically adjust the number of instances in response to the workload on your endpoint. This ensures that you have the desired performance and cost-efficiency by scaling your endpoints up when the load is high and down when it's low.

To create an Autoscaling Inference Endpoint with Amazon SageMaker using Pulumi, you'll typically go through several steps:

1. **Set up a SageMaker model**: Define the model in SageMaker, which involves specifying the location of model artifacts and the Docker container image that contains your inference code.

2. **Create an endpoint configuration**: An endpoint configuration defines the SageMaker resources, such as instance type and model to be deployed.

3. **Deploy an endpoint**: Use the endpoint configuration to deploy a SageMaker endpoint, which serves the inference requests.

4. **Configure autoscaling**: Define the autoscaling policy for your deployed SageMaker endpoint. This includes setting the minimum and maximum number of instances and the metrics that trigger scaling events.

In the following Pulumi program, we will create a SageMaker endpoint and configure autoscaling for it. For simplicity, this code assumes you have already created a model in Amazon SageMaker and have its ARN available. The autoscaling policy is set to scale between 1 and 3 instances based on the average number of invocations per instance.

```python
import pulumi
import pulumi_aws as aws

# A Pulumi program to deploy a SageMaker endpoint and configure autoscaling for it.

# Replace 'model_arn' with your actual SageMaker model's ARN
model_arn = "arn:aws:sagemaker:region:account:model/model-name"

# Create a SageMaker endpoint configuration
endpoint_config = aws.sagemaker.EndpointConfiguration("endpointConfig",
    production_variants=[{
        "instanceType": "ml.m5.large",
        "modelName": model_arn,
        "variantName": "variant-1",
        "initial_instance_count": 1,
    }]
)

# Deploy an endpoint using the endpoint configuration
endpoint = aws.sagemaker.Endpoint("endpoint",
    endpoint_config_name=endpoint_config.name
)

# Define the autoscaling policy for the endpoint
autoscaling_policy = aws.appautoscaling.Policy("autoscalingPolicy",
    resource_id=pulumi.Output.concat("endpoint/", endpoint.endpoint_name),
    scalable_dimension="sagemaker:variant:DesiredInstanceCount",
    service_namespace="sagemaker",
    policy_type="TargetTrackingScaling",
    target_tracking_scaling_policy_configuration={
        "predefined_metric_specification": {
            "predefinedMetricType": "SageMakerVariantInvocationsPerInstance",
        },
        "target_value": 100.0,
        "scale_in_cooldown": 300,
        "scale_out_cooldown": 300
    }
)

# Attach autoscaling policy to the SageMaker endpoint variant
autoscaling_attachment = aws.appautoscaling.Target("autoscalingAttachment",
    max_capacity=3,
    min_capacity=1,
    resource_id=pulumi.Output.concat("endpoint/", endpoint.endpoint_name, "/variant/", "variant-1"),
    scalable_dimension="sagemaker:variant:DesiredInstanceCount",
    service_namespace="sagemaker",
    depends_on=[autoscaling_policy]
)

# Export the endpoint name
pulumi.export("sagemaker_endpoint_name", endpoint.endpoint_name)
```

This program defines a SageMaker endpoint with a simple scaling policy based on the average number of invocations per instance. The `autoscaling_policy` defines a `TargetTrackingScaling` policy that adjusts the number of instances to maintain a target of 100 invocations per instance. The `scale_in_cooldown` and `scale_out_cooldown` ensure that after a scaling activity, the autoscaling action waits for the specified cooldown period (in seconds) before resuming scaling actions.

Make sure you have the AWS Pulumi plugin installed and configured with the necessary permissions to create these resources in your AWS account. This Pulumi program can be executed by running `pulumi up` from the directory where this script is saved. It will prompt you for confirmation before creating or updating any infrastructure.