1. Anomaly Detection via Datadog for AI Operations Monitoring


    Anomaly detection is a crucial aspect of AI Operations Monitoring, where you look for patterns in data that do not conform to expected behavior. Datadog is a monitoring service for cloud-scale applications, and it provides capabilities to track anomalies in your operations data. By setting up a monitor in Datadog, you can get alerted if there are any unexpected changes in your metrics that might indicate a problem.

    In Pulumi, using the datadog provider, we can set up a monitor to perform anomaly detection. The datadog.Monitor resource allows you to define the conditions under which you'll receive notifications. Let's create an anomaly detection monitor using Pulumi with Python.

    Here's an example of how you could set up an anomaly monitor for a hypothetical metric called ai.response.time. Assume that this metric represents the response time of an AI service.

    First, you'll use the datadog.Monitor resource to create a new monitor. You'll specify the type as anomaly to indicate you want to perform anomaly detection. The query property will contain a Datadog query that detects when the metric is considered an anomaly based on the past hour of data. Replace your_metric_here with your actual metric name.

    The message property contains the message that will be sent out when an anomaly is detected. The message often includes a directive to notify a user or service (e.g., @user to notify a specific user, @pagerduty for a PagerDuty service, etc.), followed by a descriptive message about the alert.

    Now, let's go ahead and create the Pulumi program to set up this monitor:

    import pulumi import pulumi_datadog as datadog # Define an anomaly detection monitor for AI operations monitoring anomaly_monitor = datadog.Monitor("ai-ops-anomaly-detection", type="anomaly", query="""anomalies(avg:ai.response.time{environment:production}.as_count(), 'basic', 2)""", name="AI Ops Anomaly Detection", message="Notice: An anomaly has been detected in the AI operations response time @pagerduty", priority=3, tags=["ai-ops", "anomaly-detection"], notify_no_data=True, # This will send a notification if there is no data. no_data_timeframe=20 # minutes to wait before a no_data notification is sent ) # Export the ID of the monitor pulumi.export("anomaly_monitor_id", anomaly_monitor.id)


    • type: Here you specify the type of monitor, which in our case is an anomaly monitor.
    • query: This is where you write your monitor's query. The anomalies function is used to perform the anomaly detection. The first parameter is the metric you want to analyze, followed by the algorithm (set to 'basic' here), and the number 2 which is the number of standard deviations to use for the anomaly detection.
    • name: The name of the monitor, this can be anything you find descriptive.
    • message: Contains instructions for who to notify and what message to send when the alert is triggered.
    • priority: This indicates the importance of the monitor, where 1 is the highest and 5 is the lowest.
    • tags: These are useful for categorizing and filtering monitors in the Datadog dashboard.
    • notify_no_data: This option will send a notification if no data is being received.
    • no_data_timeframe: This option specifies the number of minutes to wait before a no_data notification is sent.

    This Pulumi program will create a Datadog monitor that will alert you if ai.response.time becomes anomalous. You can customize this to track different metrics or to adjust the sensitivity of your anomaly detection.