1. AI Model Inference Latency Tracking with GCP Monitoring

    Python

    To track AI model inference latency with Google Cloud Monitoring (GCP Monitoring), we'll set up a custom monitoring solution. This solution will include a custom metric for tracking latency, a dashboard for visualization, and an alert policy to notify you if latency exceeds a certain threshold.

    Here's how you might set this up with Pulumi:

    1. Create a Custom Metric: A custom metric will record the latency of your model inference. This metric will be of type gcp.monitoring.MetricDescriptor.
    2. Report Latency Data: Your application will send latency data to Cloud Monitoring using the custom metric.
    3. Create a Dashboard: A dashboard with a widget to visualize latency over time using gcp.monitoring.Dashboard.
    4. Set up an Alert Policy: An alert policy will trigger if the latency goes above a specified threshold using gcp.monitoring.AlertPolicy.
    5. Notification Channel: Optionally, you can create a gcp.monitoring.NotificationChannel to receive alerts.

    Below is the Pulumi program that sets up the monitoring infrastructure described:

    import pulumi import pulumi_gcp as gcp # Step 1: Create a custom metric for tracking the AI model inference latency. inference_latency_metric = gcp.monitoring.MetricDescriptor("inference_latency_metric", type="custom.googleapis.com/inference_latency", metric_kind="GAUGE", value_type="DOUBLE", unit="ms", labels=[{ "key": "model_name", "value_type": "STRING", "description": "The name of the AI model" }], display_name="AI Model Inference Latency", description="The latency of AI model inferences in milliseconds" ) # Step 2: Your application needs to use the Google Cloud Monitoring API to send latency data here. # Step 3: Create a Dashboard to visualize the inference latency. dashboard = gcp.monitoring.Dashboard("dashboard", dashboard_json=pulumi.Output.all(inference_latency_metric.id).apply(lambda id: f""" {{ "displayName": "AI Model Latency Dashboard", "gridLayout": {{ "widgets": [ {{ "title": "Inference Latency", "xyChart": {{ "dataSets": [ {{ "timeSeriesQuery": {{ "timeSeriesFilter": {{ "filter": 'metric.type="{id}"' }} }} }} ], "chartOptions": {{ "mode": "LINE" }} }} }} ] }} }} """) ) # Step 4: Set up an Alert Policy for high latency. alert_policy = gcp.monitoring.AlertPolicy("high_latency_alert", combiner="OR", conditions=[{ "displayName": "High Inference Latency", "condition_threshold": { "filter": f'metric.type="custom.googleapis.com/inference_latency" AND metric.labels.model_name="your_model_name"', "duration": "60s", "comparison": "GREATER_THAN", "threshold_value": 1000, # Threshold set for 1000ms or 1s "aggregations": [{ "alignment_period": "60s", "per_series_aligner": "ALIGN_MEAN" }] } }], display_name="High Inference Latency Alert", notification_channels=[], # Add notification channels to the list if you have set them up # Example: `notification_channels=["your_notification_channel_id"]` enabled=True ) # Step 5: Optionally, create a notification channel (for example, an email notification). notification_channel = gcp.monitoring.NotificationChannel("email_notifications", type="email", display_name="Email Notification Channel", labels={ "email_address": "your-email@example.com" }, enabled=True ) # Add the notification channel to the alert policy pulumi.Output.all(alert_policy.name, notification_channel.name).apply(lambda names: gcp.monitoring.AlertPolicyNotificationChannel( "alert_policy_notification_channel", alert_policy=names[0], notification_channel=names[1], ) ) pulumi.export('Dashboard URL', dashboard.self_link) pulumi.export('Alert Policy Name', alert_policy.display_name)

    This program defines the necessary GCP Monitoring resources for tracking model inference latency. Make sure to replace placeholders like "your_model_name" and "your-email@example.com" with actual values relevant to your use case.

    Here’s what each part of the script is doing:

    • Creating a custom metric descriptor that defines the structure of the latency metric we'll be tracking.
    • The custom metric expects a label "model_name", so this should be included when reporting data to it from the application.
    • Defining a dashboard with a widget that visualizes the latency data. It uses the custom metric to create a line chart representing inference latency over time.
    • Defining an alert policy that specifies the conditions under which an alert would be triggered. Here, we've set it to alert if the mean latency exceeds 1 second over a minute.
    • Setting up a notification channel that would receive notifications when an alert is triggered. In this case, an email channel is set up to notify a provided email address.

    Deploy this program with Pulumi after filling in your details, and you'll have a monitoring system in place to track your AI model inference latency. Make sure you replace the relevant placeholder text (like model names and email addresses) with actual information.