Monitoring Large-Scale ML Workloads with Datadog

Question

Pulumi · Accepted Answer

Datadog is a monitoring service for cloud-scale applications, providing monitoring of servers, databases, tools, and services through a SaaS-based data analytics platform. For monitoring large-scale Machine Learning (ML) workloads with Datadog, we can use two main resources provided by the Datadog provider in Pulumi:

1. `datadog.Monitor`: This allows us to create and manage Datadog monitors which will notify us when certain conditions are met in the metrics we're tracking. Monitors can be set up with various types such as anomaly, metric alert, or integration based on the requirement.

2. `datadog.MetricMetadata`: This allows us to define the metadata for a specific metric which helps describe and give more context to the metric being monitored.

Below, we'll define a simple Pulumi program that sets up a monitor for a hypothetical metric `"ml_model.inference_count"` which represents the number of inferences being made by an ML model. This monitor will notify us when the number of inferences goes above a certain threshold, indicating high ML workload. Also, we'll include the metric metadata for this metric.

Before we dive into the code, here's what we need to do step-by-step:

- Import the `pulumi_datadog` module.
- Set up `MetricMetadata` for our ML metric to ensure it's well-defined in Datadog.
- Create a `Monitor` that watches over the metric, specifies the conditions under which an alert will be triggered, and the message to be sent out.

Remember to have your Datadog API and Application keys configured as environment variables (`DATADOG_API_KEY` and `DATADOG_APP_KEY`) for Pulumi to authenticate with your Datadog account.

Here is the Pulumi program to accomplish our goal:

```python
import pulumi
import pulumi_datadog as datadog

# Define the MetricMetadata for 'ml_model.inference_count' metric
ml_model_inference_count_metadata = datadog.MetricMetadata("mlModelInferenceCountMetadata",
    metric="ml_model.inference_count",
    description="Tracks the number of inferences made by the ML model",
    short_name="inference_count",
    unit="count",
    per_unit="minute",
    type="gauge"
)

# Define the Monitor for the 'ml_model.inference_count' metric
ml_model_inference_monitor = datadog.Monitor("mlModelInferenceMonitor",
    name="ML Model Inference Count Monitor",
    type="query alert",
    message="""{{#is_alert}}
                High inference load detected on ML model! Value: {{value}}
                {{/is_alert}}
                {{#is_recovery}}
                Inference load back to normal levels. Value: {{value}}
                {{/is_recovery}}""",
    query="avg(last_5m):avg:ml_model.inference_count.by{host} > 100",
    tags=["environment:production", "team:data-science"],
    notify_no_data=False,
    new_host_delay=300
)

# Export key values which can be used to identify resources and their states outside of Pulumi
pulumi.export("metric_metadata_id", ml_model_inference_count_metadata.id)
pulumi.export("monitor_id", ml_model_inference_monitor.id)
```

In this code:

- The `MetricMetadata` resource describes the `ml_model.inference_count` metric, indicating that it's a gauge that measures the count of inferences per minute with an easily identifiable short name.
- The `Monitor` resource sets up a