Alert rules reference

WEM ships three Prometheus alert rule files at /usr/local/greenplum-db/wem/config/prometheus/alerts/ on the WEM host. For setup instructions, see Installing the observability stack.

Canary rules

Defined in wem-canary.rules.yml. These rules fire based on metrics collected by WEM's canary check system.

Lock contention

Alert	Severity	Condition	Fires after
`CanaryLockCheckWarning`	Warning	Lock check duration > 50ms	5m
`CanaryLockCheckCritical`	Critical	Lock check duration > 200ms	2m

The lock check measures how long it takes to acquire a test lock on the cluster. Elevated durations indicate lock contention that may be affecting query throughput.

Transaction wraparound

Alert	Severity	Condition	Fires after
`CanaryTransactionWraparound`	Warning	Transactions since last VACUUM FREEZE > 500 million	1m
`CanaryTransactionWraparoundCritical`	Critical	Transactions since last VACUUM FREEZE > 2 billion	1m

These alerts fire when the number of transactions processed since the last VACUUM FREEZE approaches the Postgres limit of approximately 2.1 billion. Exceeding this limit causes transaction ID wraparound, which makes older data invisible to queries. The critical threshold requires immediate VACUUM FREEZE to prevent this.

Active connections

Alert	Severity	Condition	Fires after
`CanaryActiveConnectionsWarning`	Warning	Active connections > 100	5m
`CanaryActiveConnectionsCritical`	Critical	Active connections > 300	2m

These alerts fire when the number of concurrent sessions exceeds the defined thresholds. High connection counts can exhaust the cluster's max_connections limit and prevent new connections from being established.

Canary health

Alert	Severity	Condition	Fires after
`CanaryCheckStale`	Warning	Check not run in the last 30 minutes	5m
`CanaryCheckCritical`	Critical	Check status equals critical	2m

CanaryCheckStale fires when any individual canary check stops running, which may indicate that the WEM scheduler has stopped. CanaryCheckCritical is a generic rule that fires when any check reports a critical status.

System rules

Defined in wem-system.rules.yml. These rules monitor the health of the WEM service itself.

Alert	Severity	Condition	Fires after
`WEMDown`	Critical	Prometheus cannot scrape the /prom/metrics endpoint	1m
`WEMSchedulerNotActive`	Warning	No WEM instance is holding the scheduler lock	10m

WEMDown fires when Prometheus loses the ability to scrape the /prom/metrics endpoint, indicating that the WEM service is unreachable. WEMSchedulerNotActive fires when no WEM instance holds the internal scheduler lock, which means canary checks are not running even though the service may appear healthy.

WHPG rules

Defined in wem-whpg.rules.yml. These rules fire based on WarehousePG cluster metrics collected by the Collector.

Query performance

Alert	Severity	Condition	Fires after
`WHPGLongRunningQuery`	Warning	Active query running for > 300 seconds	30s

Fires when a query in pg_stat_activity exceeds the 5-minute threshold. The alert identifies the affected database and user.

Disk usage

Alert	Severity	Condition	Fires after
`WHPGHighDiskUsage`	Warning	Disk usage > 80% on any filesystem	5m
`WHPGCriticalDiskUsage`	Critical	Disk usage > 90% on any filesystem	2m

These alerts cover physical filesystems only — tmpfs, overlay, and other virtual filesystems are excluded. The annotation identifies the specific mount point and host.

Segment health

Alert	Severity	Condition	Fires after
`WHPGSegmentDown`	Critical	Segment status equals 0 (down)	1m

Fires when any primary or mirror segment reports a down status. The annotation identifies the affected segment and host.

Data distribution

Alert	Severity	Condition	Fires after
`WHPGDataSkewDetected`	Warning	Data skew coefficient of variation > 0.5	5m

Fires when row distribution across segments is significantly uneven. A coefficient of variation above 0.5 indicates one or more segments are handling a disproportionate share of the data, which can slow query execution across the cluster.