Datadog SLOs for AI Service Quality Assurance
PythonService Level Objectives (SLOs) are a key part of monitoring and managing the quality of services, including AI-powered services. SLOs help you define and track the reliability, performance, and other key metrics that matter to your users and stakeholders.
In this context, I'll show you how to define SLOs for an AI service using Datadog with Pulumi in Python. We'll use the
datadog.ServiceLevelObjective
resource, which allows you to declare an SLO that tracks the quality of your service over a specified time frame. This entails specifying objectives based on a metric query that typically monitors the ratio of "good" events to total events.To begin, we need to define a few terms relevant to SLOs:
- SLO: A Service Level Objective, which specifies a target level of reliability/availability/performance for a service.
- SLI: A Service Level Indicator, which is the metric you use to measure the performance or reliability of your service.
- Error budget: The maximum amount of time that a service can be unavailable or underperforming before it's considered out of compliance with its SLO.
In Datadog, we typically measure SLOs against one or more monitors, which you would need to have set up in advance to track the uptime, response time, or error rate of your AI service.
Below is a program written in Python using Pulumi to create a Datadog Service Level Objective (SLO). This SLO could be for monitoring the uptime of your AI service, ensuring it meets a specific availability target percentage over a 30-day rolling window, as an example.
import pulumi import pulumi_datadog as datadog # Define a Datadog SLO for the AI Service Quality Assurance ai_service_slo = datadog.ServiceLevelObjective("aiServiceSlo", # Name of the SLO name="AI Service Uptime", # Type of SLO, which in this case is based on a monitor type="monitor", # Timeframe over which to measure SLO compliance, which could be 7 days, 30 days, et cetera timeframe="30d", # ID(s) of the monitor(s) that measure the service uptime # For this you need to have a Datadog monitor created that tracks uptime for the AI service monitor_ids=[123456789], # Replace with the actual monitor ID # SLO thresholds; in this case, we are targeting 99.9% uptime thresholds=[ datadog.ServiceLevelObjectiveThresholdArgs( timeframe="30d", target=99.9, # The target percentage for uptime over the timeframe warning=99.0, # A warning level to notify before breaching the actual target ) ], # Tags can be used to filter and group SLOs in Datadog tags=[ "env:production", "team:ai" ], # An optional description of the SLO description="SLO to ensure AI Service maintains 99.9% uptime over a 30-day rolling window." ) # Export the SLO ID to reference it elsewhere, and for creating dashboards or alerts based on it pulumi.export("ai_service_slo_id", ai_service_slo.id)
This Pulumi program defines an SLO named "AI Service Uptime" which tracks the uptime of an AI service by referring to a specific Datadog monitor.
This is accomplished by creating a
ServiceLevelObjective
resource. Thetimeframe
parameter is set to30d
, meaning we want to ensure our AI service maintains the target level of uptime over the past 30 days. The SLO is set to 99.9% uptime (target=99.9
), with a warning threshold (warning=99.0
) to alert us before the service breaches the actual target. This could be indicative that the AI service is experiencing more downtime than expected and action needs to be taken before it affects the compliance with the SLO we have set. Themonitor_ids
in theServiceLevelObjective
constructor should contain the ID(s) of the Datadog monitor(s) that measure your AI service's uptime.Finally, the program exports the SLO's ID as a stack output, which can be useful for building dashboards, setting up alerts, or referencing the SLO in other parts of your infrastructure as code.
Remember to replace
123456789
with the actual ID of the monitor that you are using to track your AI service's uptime.Note that before running this program, you'll need to have the Datadog provider configured with your API and app keys, and you need to have the corresponding monitors set up in Datadog.