1. Docs
  2. Administration
  3. Self-Hosting
  4. Operations
  5. Monitoring

Monitoring and Alerting

    Self-hosting is only available with Pulumi Business Critical. If you would like to evaluate the self-hosted Pulumi Cloud, sign up for the 30-day trial or contact us.

    Effective monitoring is critical for maintaining a reliable self-hosted Pulumi Cloud deployment. This page covers a recommended alerting strategy and the key metrics to watch.

    The API service exposes Prometheus metrics and supports OpenTelemetry for tracing. See OpenTelemetry configuration for setup details.

    Three-tier alerting strategy

    Implement alerts at three severity levels:

    TierPurposeActionExamples
    Alert (page)Service-impacting issues requiring immediate responsePages on-call engineer5xx error rate > 5%, unhealthy hosts, container crash loops
    Notification (Slack/email)Degradation that needs attention during business hoursNotifies team channelHigh CPU, replication lag, storage warnings
    Information (dashboard)Anomaly detection and capacity planningLogged for reviewTraffic pattern changes, signup anomalies

    Key metrics to monitor

    Application health

    • HTTP 5xx error rate as a percentage of total requests (not just raw count)
    • Target group unhealthy host count (alert if > 0 for 3+ minutes)
    • Container restart count (alert on repeated restarts)
    • Request latency percentiles (p50, p95, p99)

    Database

    • CPU utilization - different thresholds for writer vs reader instances
    • Replication lag - alert if > 1 second
    • Freeable memory - alert when < 10% remaining
    • Storage space remaining - alert at 50 GB (notify) and 20 GB (page)
    • Total IOPS (read + write combined)
    • Connection count vs maximum connections

    Object storage

    • Request error rate (4xx, 5xx)
    • Replication lag (if cross-region replication is enabled)
    • Bucket size growth rate

    Compute

    • CPU and memory utilization per service
    • Auto-scaling group desired vs running instance count
    • NAT gateway connection/bandwidth utilization

    Anomaly detection

    Where your monitoring platform supports it, use anomaly detection (dynamic thresholds) rather than static thresholds for:

    • Traffic volume (requests per minute)
    • User signup rates
    • API latency

    This reduces alert noise from expected traffic variation while catching genuine anomalies.