Alert rules reference

WEM ships three Prometheus alert rule files at /usr/local/greenplum-db/wem/config/prometheus/alerts/ on the WEM host. For setup instructions, see Installing the observability stack.

Canary rules

Defined in wem-canary.rules.yml. These rules fire based on metrics collected by WEM's canary check system.

Lock contention

AlertSeverityConditionFires after
CanaryLockCheckWarningWarningLock check duration > 50ms5m
CanaryLockCheckCriticalCriticalLock check duration > 200ms2m

The lock check measures how long it takes to acquire a test lock on the cluster. Elevated durations indicate lock contention that may be affecting query throughput.

Transaction wraparound

AlertSeverityConditionFires after
CanaryTransactionWraparoundWarningTransactions since last VACUUM FREEZE > 500 million1m
CanaryTransactionWraparoundCriticalCriticalTransactions since last VACUUM FREEZE > 2 billion1m

These alerts fire when the number of transactions processed since the last VACUUM FREEZE approaches the Postgres limit of approximately 2.1 billion. Exceeding this limit causes transaction ID wraparound, which makes older data invisible to queries. The critical threshold requires immediate VACUUM FREEZE to prevent this.

Active connections

AlertSeverityConditionFires after
CanaryActiveConnectionsWarningWarningActive connections > 1005m
CanaryActiveConnectionsCriticalCriticalActive connections > 3002m

These alerts fire when the number of concurrent sessions exceeds the defined thresholds. High connection counts can exhaust the cluster's max_connections limit and prevent new connections from being established.

Canary health

AlertSeverityConditionFires after
CanaryCheckStaleWarningCheck not run in the last 30 minutes5m
CanaryCheckCriticalCriticalCheck status equals critical2m

CanaryCheckStale fires when any individual canary check stops running, which may indicate that the WEM scheduler has stopped. CanaryCheckCritical is a generic rule that fires when any check reports a critical status.

System rules

Defined in wem-system.rules.yml. These rules monitor the health of the WEM service itself.

AlertSeverityConditionFires after
WEMDownCriticalPrometheus cannot scrape the /prom/metrics endpoint1m
WEMSchedulerNotActiveWarningNo WEM instance is holding the scheduler lock10m

WEMDown fires when Prometheus loses the ability to scrape the /prom/metrics endpoint, indicating that the WEM service is unreachable. WEMSchedulerNotActive fires when no WEM instance holds the internal scheduler lock, which means canary checks are not running even though the service may appear healthy.

WHPG rules

Defined in wem-whpg.rules.yml. These rules fire based on WarehousePG cluster metrics collected by the Collector.

Query performance

AlertSeverityConditionFires after
WHPGLongRunningQueryWarningActive query running for > 300 seconds30s

Fires when a query in pg_stat_activity exceeds the 5-minute threshold. The alert identifies the affected database and user.

Disk usage

AlertSeverityConditionFires after
WHPGHighDiskUsageWarningDisk usage > 80% on any filesystem5m
WHPGCriticalDiskUsageCriticalDisk usage > 90% on any filesystem2m

These alerts cover physical filesystems only — tmpfs, overlay, and other virtual filesystems are excluded. The annotation identifies the specific mount point and host.

Segment health

AlertSeverityConditionFires after
WHPGSegmentDownCriticalSegment status equals 0 (down)1m

Fires when any primary or mirror segment reports a down status. The annotation identifies the affected segment and host.

Data distribution

AlertSeverityConditionFires after
WHPGDataSkewDetectedWarningData skew coefficient of variation > 0.55m

Fires when row distribution across segments is significantly uneven. A coefficient of variation above 0.5 indicates one or more segments are handling a disproportionate share of the data, which can slow query execution across the cluster.


Could this page be better? Report a problem or suggest an addition!