Add Monitoring to Your Project

By Pulumi Team
Published
Updated

The Challenge

You need visibility into how your infrastructure and applications are performing in production. Without monitoring, you find out about problems when users report them. Dashboards, alarms, centralized logging, and distributed tracing give you the ability to detect issues before users are affected and diagnose root causes quickly when problems occur.

What You'll Build

  • Dashboards for key metrics visualization
  • Alarms for critical metric thresholds
  • Centralized log aggregation and search
  • Distributed tracing across services
  • Alert notifications via email or messaging

Neo Try This Prompt in Pulumi Neo

Run this prompt in Neo to deploy your infrastructure, or edit it to customize.

Best For

Use this prompt when you need to add observability to an existing deployment. This is essential before going to production, after scaling to multiple services, or when operational issues are taking too long to diagnose. The monitoring stack integrates with any cloud infrastructure and provides the visibility needed for production operations.

Architecture Overview

This architecture adds three pillars of observability to your existing infrastructure: metrics and dashboards, centralized logging, and distributed tracing. Together, they answer the three fundamental operational questions: “Is something broken?” (alarms), “What is happening right now?” (dashboards and logs), and “Where is the bottleneck?” (traces). These capabilities layer on top of your existing deployment without requiring changes to your application architecture.

Dashboards provide a real-time view of system health. They visualize CPU utilization, memory usage, request rates, error rates, and latency percentiles across your infrastructure. Good dashboards show enough information to assess system health at a glance, with the ability to drill down into specific services or time ranges when investigating an issue.

Alarms close the gap between “something went wrong” and “someone noticed.” Rather than watching dashboards continuously, alarms monitor specific metrics and trigger notifications when thresholds are breached. An alarm on error rate catches a bad deployment within minutes. An alarm on CPU utilization warns you before a server runs out of capacity. Alarms turn reactive troubleshooting into proactive incident detection.

Dashboards and Metrics

Dashboards organize metrics into visual panels that show the current state and recent trends of your infrastructure. Key metrics include CPU and memory utilization for compute resources, request count and latency for APIs, error rates for application health, and queue depth for async processing. Dashboards are most useful when organized around services or workflows rather than individual resources.

Alarms and Notifications

Alarms evaluate metrics against configured thresholds and trigger actions when breached. A typical setup includes alarms for error rate exceeding a percentage threshold, p99 latency exceeding a time threshold, CPU utilization exceeding a capacity threshold, and available disk space dropping below a safety margin. Alarm notifications go to email, messaging channels, or on-call systems so the right people are informed immediately.

Centralized Logging

Log aggregation collects application and infrastructure logs from all services into a single searchable repository. Metric filters extract structured data from log streams, turning log patterns (like error messages or slow query warnings) into queryable metrics. This means you can alarm on patterns in your logs, not just on infrastructure metrics, catching application-level issues that raw metrics would miss.

Distributed Tracing

Distributed tracing assigns a unique ID to each incoming request and propagates it across service boundaries. When a request flows from an API gateway through multiple backend services to a database, the trace captures timing information at each hop. This makes it straightforward to identify which service in a chain is adding latency or producing errors, which is difficult to determine from logs or metrics alone.

Common Customizations

  • Add synthetic monitoring: Extend the prompt to include synthetic checks that periodically test your endpoints from external locations, catching outages that internal monitoring might miss.
  • Add anomaly detection: Request machine-learning-based anomaly detection on key metrics to catch unusual patterns without manually setting thresholds.
  • Add SLO tracking: Ask for service level objective dashboards that track your error budget consumption, showing how much room you have before violating your availability targets.
  • Add on-call integration: Request integration with an on-call management system so critical alarms automatically page the on-call engineer with relevant context.