Monitoring Large-Scale ML Workloads with Datadog
PythonDatadog is a monitoring service for cloud-scale applications, providing monitoring of servers, databases, tools, and services through a SaaS-based data analytics platform. For monitoring large-scale Machine Learning (ML) workloads with Datadog, we can use two main resources provided by the Datadog provider in Pulumi:
-
datadog.Monitor
: This allows us to create and manage Datadog monitors which will notify us when certain conditions are met in the metrics we're tracking. Monitors can be set up with various types such as anomaly, metric alert, or integration based on the requirement. -
datadog.MetricMetadata
: This allows us to define the metadata for a specific metric which helps describe and give more context to the metric being monitored.
Below, we'll define a simple Pulumi program that sets up a monitor for a hypothetical metric
"ml_model.inference_count"
which represents the number of inferences being made by an ML model. This monitor will notify us when the number of inferences goes above a certain threshold, indicating high ML workload. Also, we'll include the metric metadata for this metric.Before we dive into the code, here's what we need to do step-by-step:
- Import the
pulumi_datadog
module. - Set up
MetricMetadata
for our ML metric to ensure it's well-defined in Datadog. - Create a
Monitor
that watches over the metric, specifies the conditions under which an alert will be triggered, and the message to be sent out.
Remember to have your Datadog API and Application keys configured as environment variables (
DATADOG_API_KEY
andDATADOG_APP_KEY
) for Pulumi to authenticate with your Datadog account.Here is the Pulumi program to accomplish our goal:
import pulumi import pulumi_datadog as datadog # Define the MetricMetadata for 'ml_model.inference_count' metric ml_model_inference_count_metadata = datadog.MetricMetadata("mlModelInferenceCountMetadata", metric="ml_model.inference_count", description="Tracks the number of inferences made by the ML model", short_name="inference_count", unit="count", per_unit="minute", type="gauge" ) # Define the Monitor for the 'ml_model.inference_count' metric ml_model_inference_monitor = datadog.Monitor("mlModelInferenceMonitor", name="ML Model Inference Count Monitor", type="query alert", message="""{{#is_alert}} High inference load detected on ML model! Value: {{value}} {{/is_alert}} {{#is_recovery}} Inference load back to normal levels. Value: {{value}} {{/is_recovery}}""", query="avg(last_5m):avg:ml_model.inference_count.by{host} > 100", tags=["environment:production", "team:data-science"], notify_no_data=False, new_host_delay=300 ) # Export key values which can be used to identify resources and their states outside of Pulumi pulumi.export("metric_metadata_id", ml_model_inference_count_metadata.id) pulumi.export("monitor_id", ml_model_inference_monitor.id)
In this code:
- The
MetricMetadata
resource describes theml_model.inference_count
metric, indicating that it's a gauge that measures the count of inferences per minute with an easily identifiable short name. - The
Monitor
resource sets up a
-