AI Model Inference Latency Tracking with GCP Monitoring

Question

Pulumi · Accepted Answer

To track AI model inference latency with Google Cloud Monitoring (GCP Monitoring), we'll set up a custom monitoring solution. This solution will include a custom metric for tracking latency, a dashboard for visualization, and an alert policy to notify you if latency exceeds a certain threshold.

Here's how you might set this up with Pulumi:

1. **Create a Custom Metric**: A custom metric will record the latency of your model inference. This metric will be of type `gcp.monitoring.MetricDescriptor`.
2. **Report Latency Data**: Your application will send latency data to Cloud Monitoring using the custom metric.
3. **Create a Dashboard**: A dashboard with a widget to visualize latency over time using `gcp.monitoring.Dashboard`.
4. **Set up an Alert Policy**: An alert policy will trigger if the latency goes above a specified threshold using `gcp.monitoring.AlertPolicy`.
5. **Notification Channel**: Optionally, you can create a `gcp.monitoring.NotificationChannel` to receive alerts.

Below is the Pulumi program that sets up the monitoring infrastructure described:

```python
import pulumi
import pulumi_gcp as gcp

# Step 1: Create a custom metric for tracking the AI model inference latency.

inference_latency_metric = gcp.monitoring.MetricDescriptor("inference_latency_metric",
    type="custom.googleapis.com/inference_latency",
    metric_kind="GAUGE",
    value_type="DOUBLE",
    unit="ms",
    labels=[{
        "key": "model_name",
        "value_type": "STRING",
        "description": "The name of the AI model"
    }],
    display_name="AI Model Inference Latency",
    description="The latency of AI model inferences in milliseconds"
)

# Step 2: Your application needs to use the Google Cloud Monitoring API to send latency data here.

# Step 3: Create a Dashboard to visualize the inference latency.

dashboard = gcp.monitoring.Dashboard("dashboard",
    dashboard_json=pulumi.Output.all(inference_latency_metric.id).apply(lambda id: f"""
    {{
        "displayName": "AI Model Latency Dashboard",
        "gridLayout": {{
            "widgets": [
                {{
                    "title": "Inference Latency",
                    "xyChart": {{
                        "dataSets": [
                            {{
                                "timeSeriesQuery": {{
                                    "timeSeriesFilter": {{
                                        "filter": 'metric.type="{id}"'
                                    }}
                                }}
                            }}
                        ],
                        "chartOptions": {{
                            "mode": "LINE"
                        }}
                    }}
                }}
            ]
        }}
    }}
    """)
)

# Step 4: Set up an Alert Policy for high latency.

alert_policy = gcp.monitoring.AlertPolicy("high_latency_alert",
    combiner="OR",
    conditions=[{
        "displayName": "High Inference Latency",
        "condition_threshold": {
            "filter": f'metric.type="custom.googleapis.com/inference_latency" AND metric.labels.model_name="your_model_name"',
            "duration": "60s",
            "comparison": "GREATER_THAN",
            "threshold_value": 1000,  # Threshold set for 1000ms or 1s
            "aggregations": [{ "alignment_period": "60s", "per_series_aligner": "ALIGN_MEAN" }]
        }
    }],
    display_name="High Inference Latency Alert",
    notification_channels=[],
    # Add notification channels to the list if you have set them up
    # Example: `notification_channels=["your_notification_channel_id"]`
    enabled=True
)

# Step 5: Optionally, create a notification channel (for example, an email notification).

notification_channel = gcp.monitoring.NotificationChannel("email_notifications",
    type="email",
    display_name="Email Notification Channel",
    labels={
        "email_address": "your-email@example.com"
    },
    enabled=True
)

# Add the notification channel to the alert policy
pulumi.Output.all(alert_policy.name, notification_channel.name).apply(lambda names: 
    gcp.monitoring.AlertPolicyNotificationChannel(
        "alert_policy_notification_channel",
        alert_policy=names[0],
        notification_channel=names[1],
    )
)

pulumi.export('Dashboard URL', dashboard.self_link)
pulumi.export('Alert Policy Name', alert_policy.display_name)
```

This program defines the necessary GCP Monitoring resources for tracking model inference latency. Make sure to replace placeholders like `"your_model_name"` and `"your-email@example.com"` with actual values relevant to your use case.

Here’s what each part of the script is doing:
- **Creating a custom metric descriptor** that defines the structure of the latency metric we'll be tracking.
- The custom metric expects a label "model_name", so this should be included when reporting data to it from the application.
- **Defining a dashboard with a widget** that visualizes the latency data. It uses the custom metric to create a line chart representing inference latency over time.
- **Defining an alert policy** that specifies the conditions under which an alert would be triggered. Here, we've set it to alert if the mean latency exceeds 1 second over a minute.
- **Setting up a notification channel** that would receive notifications when an alert is triggered. In this case, an email channel is set up to notify a provided email address.

Deploy this program with Pulumi after filling in your details, and you'll have a monitoring system in place to track your AI model inference latency. Make sure you replace the relevant placeholder text (like model names and email addresses) with actual information.