Automated Anomaly Detection in AI Systems with Datadog

Question

Pulumi · Accepted Answer

To set up automated anomaly detection in AI systems with Datadog using Pulumi, we will use two primary resources from the Datadog Pulumi provider: `datadog.Monitor` and `datadog.MetricMetadata`. The `datadog.Monitor` resource allows you to create and manage Datadog monitor configurations, which can be used to detect anomalies based on various metrics. The `datadog.MetricMetadata` resource is used to assign metadata to custom metrics, helping with organization and interpretation.

Here's a detailed explanation of how we can create a Datadog-powered anomaly detection system:

1. **datadog.Monitor**: This is the core resource to set up anomaly detection. We can define the type of monitor we want, such as `query alert`. The `query` property specifies what metric we want to monitor. Datadog provides an anomaly detection function that we can use within this query to detect unexpected behavior. We will also set properties such as `message` to notify our team when an anomaly is detected.

2. **datadog.MetricMetadata**: This is an optional step if you have custom metrics and want to set or update their metadata to help with clarity and filtering within the Datadog UI.

First, we will establish the monitor for anomaly detection. In our case, let's say that our AI system reports a metric called `ai.system.inference.time` which records the time taken for an inference. We want to monitor this metric for any unusual spikes or drops which could indicate potential problems.

To get started with Pulumi, you would first need to install the Pulumi CLI and the Datadog Pulumi provider. Afterward, you could use the Pulumi Python SDK to write your program like below:

```python
import pulumi
import pulumi_datadog as datadog

# Define the monitor for anomaly detection
anomaly_detection_monitor = datadog.Monitor("anomalyDetectionMonitor",
    type="query alert",
    query="avg(last_5m):anomalies(avg:ai.system.inference.time{environment:production}.fill(null), 'basic', 2)",
    name="AI System Inference Time Anomaly Detection",
    message="Notification Message @pagerduty",
    tags=["ai-system", "anomaly-detection", "production"],
    priority=1
)

# Optionally define metric metadata if you have custom metrics
metric_metadata = datadog.MetricMetadata("aiSystemInferenceTimeMetadata",
    metric="ai.system.inference.time",
    type="gauge",
    description="Time taken for the AI system to perform an inference",
    shortName="Inference Time",
    unit="seconds",
    perUnit="inference"
)

# Export the anomaly detection monitor id
pulumi.export("anomalyDetectionMonitorId", anomaly_detection_monitor.id)
```

In the above program,

- We create a monitor with `type="query alert"` which is suitable for anomaly detection.
- The `query` uses the `anomalies` function provided by Datadog to analyze your specified metric over a certain period (`last_5m` refer to the last 5 minutes here) and to identify any behavior that is outside of what's expected (`basic` algorithm with two deviations).
- We define a `message` that includes an alert notification system (in this case `@pagerduty`) to be notified when an anomaly is detected.
- The `datadog.Monitor` resource is tagged with relevant labels like "anomaly-detection" and "production" which helps with organizing and filtering monitors within Datadog.
- We create a `datadog.MetricMetadata` resource to add additional context for the `ai.system.inference.time` metric.

This program is a starting point and can be expanded upon depending on the complexity and specifics of your AI system and the metrics you want to monitor.

You can learn more about Datadog monitors in Pulumi from [Datadog Monitor](https://www.pulumi.com/registry/packages/datadog/api-docs/monitor/) and for metric metadata from [Datadog MetricMetadata](https://www.pulumi.com/registry/packages/datadog/api-docs/metricmetadata/).