1. Monitoring Large-Scale ML Workloads with Datadog


    Datadog is a monitoring service for cloud-scale applications, providing monitoring of servers, databases, tools, and services through a SaaS-based data analytics platform. For monitoring large-scale Machine Learning (ML) workloads with Datadog, we can use two main resources provided by the Datadog provider in Pulumi:

    1. datadog.Monitor: This allows us to create and manage Datadog monitors which will notify us when certain conditions are met in the metrics we're tracking. Monitors can be set up with various types such as anomaly, metric alert, or integration based on the requirement.

    2. datadog.MetricMetadata: This allows us to define the metadata for a specific metric which helps describe and give more context to the metric being monitored.

    Below, we'll define a simple Pulumi program that sets up a monitor for a hypothetical metric "ml_model.inference_count" which represents the number of inferences being made by an ML model. This monitor will notify us when the number of inferences goes above a certain threshold, indicating high ML workload. Also, we'll include the metric metadata for this metric.

    Before we dive into the code, here's what we need to do step-by-step:

    • Import the pulumi_datadog module.
    • Set up MetricMetadata for our ML metric to ensure it's well-defined in Datadog.
    • Create a Monitor that watches over the metric, specifies the conditions under which an alert will be triggered, and the message to be sent out.

    Remember to have your Datadog API and Application keys configured as environment variables (DATADOG_API_KEY and DATADOG_APP_KEY) for Pulumi to authenticate with your Datadog account.

    Here is the Pulumi program to accomplish our goal:

    import pulumi import pulumi_datadog as datadog # Define the MetricMetadata for 'ml_model.inference_count' metric ml_model_inference_count_metadata = datadog.MetricMetadata("mlModelInferenceCountMetadata", metric="ml_model.inference_count", description="Tracks the number of inferences made by the ML model", short_name="inference_count", unit="count", per_unit="minute", type="gauge" ) # Define the Monitor for the 'ml_model.inference_count' metric ml_model_inference_monitor = datadog.Monitor("mlModelInferenceMonitor", name="ML Model Inference Count Monitor", type="query alert", message="""{{#is_alert}} High inference load detected on ML model! Value: {{value}} {{/is_alert}} {{#is_recovery}} Inference load back to normal levels. Value: {{value}} {{/is_recovery}}""", query="avg(last_5m):avg:ml_model.inference_count.by{host} > 100", tags=["environment:production", "team:data-science"], notify_no_data=False, new_host_delay=300 ) # Export key values which can be used to identify resources and their states outside of Pulumi pulumi.export("metric_metadata_id", ml_model_inference_count_metadata.id) pulumi.export("monitor_id", ml_model_inference_monitor.id)

    In this code:

    • The MetricMetadata resource describes the ml_model.inference_count metric, indicating that it's a gauge that measures the count of inferences per minute with an easily identifiable short name.
    • The Monitor resource sets up a