AI Infrastructure Health Alerts using Datadog
PythonTo set up infrastructure health alerts using Datadog with Pulumi, you will create a program that defines a Datadog monitor. A monitor can watch a variety of metrics or integrations and trigger notifications based on specified conditions. For example, you can create an alert to notify you if your server's CPU usage goes above a certain threshold for a specific period.
Below is a Pulumi program that demonstrates how to create a new monitor in Datadog which will trigger an alert if the average CPU usage goes over 80% on any host over the last 5 minutes.
Preparing the Environment
Before you begin, you should have:
- A Datadog account - You will need your Datadog API and application keys to authenticate with their service.
- Pulumi CLI installed - This will execute your code and provision the infrastructure.
- Python and Pulumi's Python SDK installed.
Structure of the Program
Now let's dive into the code:
- Imports: Import the necessary Pulumi and Datadog Python packages.
- Configuration: Set up your Datadog provider instance.
- Monitor Definition: Define your monitor, which includes the metric to check, the threshold, and the type of alert.
I'll now show you a Python program that sets up such a monitor.
import pulumi import pulumi_datadog as datadog # Your Datadog API key and app key should be configured via Pulumi secrets or environment variables # Not presented in the code for security reasons. datadog_provider = datadog.Provider("datadog_provider", api_key=pulumi.Config("datadog").require_secret("api_key"), app_key=pulumi.Config("datadog").require_secret("app_key")) # This line defines a new Datadog monitor. cpu_usage_monitor = datadog.Monitor("cpu_usage_monitor", type="metric alert", query="avg(last_5m):avg:system.cpu.user{*} by {host} > 80", name="High CPU usage alert on host", message="Notification message to alert users, example: '@pagerduty - CPU usage on {{host.name}} is over 80%'.", tags=["environment:production", "team:core-infrastructure"], options=datadog.MonitorOptionsArgs( thresholds=datadog.MonitorOptionsThresholdsArgs( critical=80, ), notify_no_data=True, new_host_delay=300, evaluation_delay=60, no_data_timeframe=5, ), provider=datadog_provider) # Export the ID of the monitor to access it later pulumi.export('cpu_usage_monitor_id', cpu_usage_monitor.id)
Explanation:
-
Monitor: The
datadog.Monitor
resource creates a monitor that triggers when the conditions defined in thequery
parameter are met. Here, the query is asking Datadog to alert when the average CPU usage (system.cpu.user
) over the past 5 minutes is above 80% across all hosts ({*} by {host}
). -
Options: Under the options, we've set some additional parameters for our monitor:
thresholds
: Sets the critical threshold for the alert to 80%.notify_no_data
: Notifies if there is no data.new_host_delay
: Delays evaluation for new hosts for 5 minutes (300 seconds).evaluation_delay
: Delays evaluation to prevent flapping alerts.no_data_timeframe
: The time frame (in minutes) to evaluate if a monitor can notify when there is no data.
-
Message: The
message
field allows you to define the content of the notification message that gets sent when the monitor triggers. It supports template variables and raw strings. -
Tags: Tags like
environment:production
andteam:core-infrastructure
can be used for at-a-glance information and to allow filtering and grouping of monitors within Datadog. -
Provider: The
datadog.Provider
instance configures the credentials for the Datadog API. Theapi_key
andapp_key
are securely fetched from Pulumi's configuration system, which can store them as secrets. -
Export: At the end of the program, we export the ID of the monitor. This is useful if you need to reference this monitor in other infrastructure configurations or in alerts.
After setting up your Pulumi program, deploy it using the Pulumi CLI. Once deployed, your new monitor will be active in Datadog, and you'll be notified when the defined conditions are met. Remember to replace
your_api_key
andyour_app_key
with your actual Datadog API and application keys and manage them securely.