Monitoring AI Workload Throughput with NewRelic NRQLAlertCondition

Question

Pulumi · Accepted Answer

Monitoring the throughput of an AI workload is critical in ensuring that your application maintains its performance and meets service level agreements (SLAs). By using New Relic and setting up NRQL Alert Conditions, you can track specific metrics that indicate the health and performance of your AI systems.

New Relic's NRQL (New Relic Query Language) alert conditions allow you to write queries that can trigger notifications based on the results. For example, you might want to be alerted when the throughput (such as number of transactions per minute) falls below a certain threshold or experiences an unexpected behavior.

Below, you will find a Pulumi program written in Python that sets up an NRQL alert condition. This alert condition will monitor the throughput of an AI workload, and it assumes you have already set up a New Relic policy under which this alert condition will be created.

Here's what each part of the code does:

Import Pulumi NewRelic: We import the pulumi_newrelic package which contains the classes and methods required to interact with New Relic resources.
Policy ID: You will need to provide the ID of an existing New Relic alert policy which groups together one or more alert conditions.
NRQL Alert Condition: We define an NRQL alert condition with a specific query that monitors the desired throughput metric, along with the critical threshold that, if crossed, will trigger an alert.
Export: We export the ID of the NRQL alert condition so you can easily reference it, for instance, in the New Relic UI or in further infrastructure as code scripts.

Please replace the placeholder <Your-Policy-ID> with your actual New Relic alert policy ID and modify the NRQL query to match the specifications of your workload and what exactly you want to monitor.

import pulumi
import pulumi_newrelic as newrelic

# Create a New Relic NRQL Alert Condition to monitor the throughput of an AI workload
ai_workload_throughput_alert = newrelic.NrqlAlertCondition(
    "aiWorkloadThroughputAlert",
    policy_id=<Your-Policy-ID>,
    # The NRQL query to run against your New Relic data. This will need to be structured based on your data schema.
    nrql={
        "query": "SELECT rate(count(*), 1 minute) FROM Transaction WHERE appName = 'YourAIApplicationName'"
    },
    critical={
        "operator": "below",
        "threshold": 50,  # Threshold for the alert, adjust this based on your workload needs
        "threshold_duration": 5,  # The duration of the condition.
        "threshold_occurrences": "ALL"  # The condition must be true for all occurrences during the threshold_duration.
    },
    # How long, in minutes, to wait before considering the condition as a violation.
    # For example, if your workload is variable, you might want to wait for a period of time before an alert is triggered.
    violation_time_limit_seconds=1200
)

# Export the ID of the Alert Condition
pulumi.export("ai_workload_throughput_alert_id", ai_workload_throughput_alert.id)

This Pulumi program will provide you with the scaffolding needed to start monitoring AI Workload Throughput with an NRQL alert condition. Remember to consult the NewRelic NRQL Alert Condition documentation for further details on each field and the full range of available settings.

Adjust parameters such as the threshold, threshold_duration, and violation_time_limit_seconds based on your specific workload's behavior and requirements. Make sure that your New Relic instrumentation is correctly reporting the relevant data that your NRQL query will be operating on.