ML Pipelines Anomaly Detection through Datadog for AWS

Question

Pulumi · Accepted Answer

To implement a machine learning (ML) pipeline's anomaly detection in AWS with monitoring through Datadog, you will need to set up various AWS resources for the ML pipeline and Datadog resources for observability. Here, I will focus on the Datadog part, specifically setting up a monitor that can detect anomalies in your pipeline's metrics.

In Pulumi, we can create Datadog resources, such as metrics and monitors, using the `pulumi_datadog` package. For our use case, you'll want to create a `Monitor` that will watch a specific metric—perhaps the execution time of your ML pipeline or the error rates—and alert you when it detects values that deviate from the expected range—an anomaly.

Below is a Pulumi program written in Python that sets up an anomaly detection monitor with Datadog. Remember that you should already have your AWS resources in place for the ML pipeline and your Datadog provider configured with the necessary API and application keys.

```python
import pulumi
import pulumi_datadog as datadog

# Define the name of the metric you want to monitor (e.g., 'ml.pipeline.execution.time').
metric_name = "ml.pipeline.execution.time"

# Define the anomaly detection query for the monitor. Adjust the query as needed for your metric.
anomaly_detection_query = f"anomalies(avg:{metric_name}{{*}}, 'basic', 2)"

# Create a Datadog monitor for anomaly detection in the ML pipeline.
ml_pipeline_monitor = datadog.Monitor("mlPipelineAnomalyDetectionMonitor",
    name="ML Pipeline Anomaly Detection",
    type="query alert",
    query=anomaly_detection_query,
    message="Anomaly detected in ML pipeline!",
    tags=["environment:production", "team:data-science"],
    priority=3,
    notify_no_data=True,
    no_data_timeframe=20,  # The timeframe (in minutes) to alert when no data is received.
    renotify_interval=10,  # Time (in minutes) to wait before sending a renotification.
    options=datadog.MonitorOptionsArgs(
        evaluate_low_sample_count_percentiles="",
        new_host_delay=300,
        include_tags=True,
        escalate_message="Anomaly persists in ML pipeline!",
    )
)

# Export the URL to the Datadog dashboard for this monitor.
pulumi.export("datadog_dashboard_url", pulumi.Output.concat("https://app.datadoghq.com/monitors/", ml_pipeline_monitor.id))
```

In this program, we define the `ml_pipeline_monitor` which uses the `Monitor` class from the Pulumi Datadog package. When you create a monitor, you provide several parameters such as the `name` for identification, the `type` which here is `query alert` for alerts based on a query, `query` which defines the metric to monitor and the condition for anomaly detection, and `message` to convey what the alert is about when triggered.

We also include `tags` to categorize and filter in the Datadog dashboard and `priority` to indicate the urgency of the monitor. In the `options`, we can specify additional settings like `evaluate_low_sample_count_percentiles`, `new_host_delay`, `include_tags`, and an `escalate_message` which is a follow-up message if the condition persists.

After running the above program with `pulumi up`, Pulumi will provision a new Monitor in Datadog to detect anomalies as per your defined metric and settings. When an anomaly is detected, you will receive an alert based on the notification settings you have configured in your Datadog account.