Automated Anomaly Detection for AI Workloads using CloudWatch

Question

Pulumi · Accepted Answer

To set up automated anomaly detection for AI workloads using AWS CloudWatch, you'll typically need to create a CloudWatch Alarm that triggers on an anomaly detection model. Anomaly detection models in CloudWatch allow you to establish a normal baseline of metrics, against which anomalies will be detected. When an anomaly is detected according to the model's confidence bands, the CloudWatch Alarm can perform various actions, such as sending a notification to an SNS topic or invoking an AWS Lambda function.

Below is a Pulumi Python program that defines:
- A CloudWatch metric alarm that uses an anomaly detection model.
- An SNS topic where notifications will be sent if the alarm state changes.

The alarm tracks a specific metric from your AI workload, which in this example is the number of invocations of an AWS Lambda function (this could be part of your AI workload). The alarm is configured to trigger if the invocations are higher or lower than expected, based on the anomaly detection model.

```python
import pulumi
import pulumi_aws as aws

# Create an SNS topic that will receive notifications when the alarm state changes
alarm_topic = aws.sns.Topic("alarmTopic")

# Define the necessary permission to allow CloudWatch alarms to publish to the SNS topic
alarm_topic_policy = aws.sns.TopicPolicy("alarmTopicPolicy",
    arn=alarm_topic.arn,
    policy=alarm_topic.arn.apply(lambda arn: """{
        "Version": "2012-10-17",
        "Id": "default",
        "Statement": [
            {
                "Sid": "AllowPublishFromCloudWatchAlarms",
                "Effect": "Allow",
                "Principal": {
                    "Service": "cloudwatch.amazonaws.com"
                },
                "Action": "SNS:Publish",
                "Resource": "%s"
            }
        ]
    }""" % arn)
)

# Define a CloudWatch metric alarm based on anomaly detection
anomaly_detection_alarm = aws.cloudwatch.MetricAlarm("anomalyDetectionAlarm",
    # Use a specific namespace, metricName and dimensions based on your AI workload
    namespace="AWS/Lambda",
    metric_name="Invocations",
    dimensions={
        "FunctionName": "your-ai-lambda-function-name",
    },
    comparison_operator="LessThanLowerOrGreaterThanUpperThreshold",
    statistic="Sum",
    # Define the threshold model as 'AnomalyDetection'
    threshold_metric_id="e1",
    # Configure the number of evaluation periods
    evaluation_periods=2,
    # Set actions to trigger when alarm state changes
    alarm_actions=[alarm_topic.arn],
    ok_actions=[alarm_topic.arn],
    # Configure to treat missing data as notBreaching, this could be changed based on use-case
    treat_missing_data="notBreaching",
    # Define the metrics for anomaly detection
    metric_query=[
        aws.cloudwatch.MetricAlarmMetricQueryArgs(
            id="e1",
            expression="ANOMALY_DETECTION_BAND(m1, 2)",
            label="Invocations (Anomaly Detection)",
            return_data=True
        ),
        aws.cloudwatch.MetricAlarmMetricQueryArgs(
            id="m1",
            metric=aws.cloudwatch.MetricAlarmMetricQueryMetricArgs(
                metric_name="Invocations",
                namespace="AWS/Lambda",
                statistic="Sum",
                period=300,
                unit="Count"
            )
        )
    ]
)

# Export the name of the topic and the ARN of the CloudWatch alarm for reference
pulumi.export('alarm_topic_name', alarm_topic.name)
pulumi.export('cloudwatch_alarm_arn', anomaly_detection_alarm.arn)
```

This Pulumi program configures anomaly detection for a hypothetical AWS Lambda function that's part of an AI workload. It uses the `aws.cloudwatch.MetricAlarm` resource ([CloudWatch Metric Alarm](https://www.pulumi.com/registry/packages/aws/api-docs/cloudwatch/metricAlarm/)) to define a metric alarm based on the number of function invocations.

A `MetricAlarmMetricQueryArgs` is included in the `metric_query` parameter to define the actual data point to analyze, in this case, the sum of Lambda invocations over a 5-minute period (`period=300`). The `expression="ANOMALY_DETECTION_BAND(m1, 2)"` defines the anomaly detection model, with '2' specifying the number of standard deviations from the baseline to allow.

When the CloudWatch alarm detects an anomaly based on the configured band, it triggers actions specified in `alarm_actions` and `ok_actions`. Here, those are configured to send a notification to an SNS topic that we have also defined in this program.

Remember to replace `"your-ai-lambda-function-name"` with the actual name of your Lambda function.

To use this program:
1. Install the Pulumi CLI and configure your AWS credentials.
2. Create a new Pulumi Python project.
3. Write this Python program in a file named `__main__.py`.
4. Run `pulumi up` to deploy the resources.

The Pulumi CLI will output the names and ARNs of the created resources once the deployment is complete. These can be used to validate whether the alarm is configured correctly or not by checking the CloudWatch console or looking for notifications published to the SNS topic.