Monitoring and Alerting
Self-hosting is only available with Pulumi Business Critical. If you would like to evaluate the self-hosted Pulumi Cloud, sign up for the 30-day trial or contact us.
Effective monitoring is critical for maintaining a reliable self-hosted Pulumi Cloud deployment. This page covers a recommended alerting strategy and the key metrics to watch.
The API service exposes Prometheus metrics and supports OpenTelemetry for tracing. See OpenTelemetry configuration for setup details.
Three-tier alerting strategy
Implement alerts at three severity levels:
| Tier | Purpose | Action | Examples |
|---|---|---|---|
| Alert (page) | Service-impacting issues requiring immediate response | Pages on-call engineer | 5xx error rate > 5%, unhealthy hosts, container crash loops |
| Notification (Slack/email) | Degradation that needs attention during business hours | Notifies team channel | High CPU, replication lag, storage warnings |
| Information (dashboard) | Anomaly detection and capacity planning | Logged for review | Traffic pattern changes, signup anomalies |
Key metrics to monitor
Application health
- HTTP 5xx error rate as a percentage of total requests (not just raw count)
- Target group unhealthy host count (alert if > 0 for 3+ minutes)
- Container restart count (alert on repeated restarts)
- Request latency percentiles (p50, p95, p99)
Database
- CPU utilization - different thresholds for writer vs reader instances
- Replication lag - alert if > 1 second
- Freeable memory - alert when < 10% remaining
- Storage space remaining - alert at 50 GB (notify) and 20 GB (page)
- Total IOPS (read + write combined)
- Connection count vs maximum connections
Object storage
- Request error rate (4xx, 5xx)
- Replication lag (if cross-region replication is enabled)
- Bucket size growth rate
Compute
- CPU and memory utilization per service
- Auto-scaling group desired vs running instance count
- NAT gateway connection/bandwidth utilization
Anomaly detection
Where your monitoring platform supports it, use anomaly detection (dynamic thresholds) rather than static thresholds for:
- Traffic volume (requests per minute)
- User signup rates
- API latency
This reduces alert noise from expected traffic variation while catching genuine anomalies.
Thank you for your feedback!
If you have a question about how to use Pulumi, reach out in Community Slack.
Open an issue on GitHub to report a problem or suggest an improvement.