WEM ships three Prometheus alert rule files at /usr/local/greenplum-db/wem/config/prometheus/alerts/ on the WEM host. For setup instructions, see Installing the observability stack.
Canary rules
Defined in wem-canary.rules.yml. These rules fire based on metrics collected by WEM's canary check system.
Lock contention
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
CanaryLockCheckWarning | Warning | Lock check duration > 50ms | 5m |
CanaryLockCheckCritical | Critical | Lock check duration > 200ms | 2m |
The lock check measures how long it takes to acquire a test lock on the cluster. Elevated durations indicate lock contention that may be affecting query throughput.
Transaction wraparound
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
CanaryTransactionWraparound | Warning | Transactions since last VACUUM FREEZE > 500 million | 1m |
CanaryTransactionWraparoundCritical | Critical | Transactions since last VACUUM FREEZE > 2 billion | 1m |
These alerts fire when the number of transactions processed since the last VACUUM FREEZE approaches the Postgres limit of approximately 2.1 billion. Exceeding this limit causes transaction ID wraparound, which makes older data invisible to queries. The critical threshold requires immediate VACUUM FREEZE to prevent this.
Active connections
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
CanaryActiveConnectionsWarning | Warning | Active connections > 100 | 5m |
CanaryActiveConnectionsCritical | Critical | Active connections > 300 | 2m |
These alerts fire when the number of concurrent sessions exceeds the defined thresholds. High connection counts can exhaust the cluster's max_connections limit and prevent new connections from being established.
Canary health
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
CanaryCheckStale | Warning | Check not run in the last 30 minutes | 5m |
CanaryCheckCritical | Critical | Check status equals critical | 2m |
CanaryCheckStale fires when any individual canary check stops running, which may indicate that the WEM scheduler has stopped. CanaryCheckCritical is a generic rule that fires when any check reports a critical status.
System rules
Defined in wem-system.rules.yml. These rules monitor the health of the WEM service itself.
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
WEMDown | Critical | Prometheus cannot scrape the /prom/metrics endpoint | 1m |
WEMSchedulerNotActive | Warning | No WEM instance is holding the scheduler lock | 10m |
WEMDown fires when Prometheus loses the ability to scrape the /prom/metrics endpoint, indicating that the WEM service is unreachable. WEMSchedulerNotActive fires when no WEM instance holds the internal scheduler lock, which means canary checks are not running even though the service may appear healthy.
WHPG rules
Defined in wem-whpg.rules.yml. These rules fire based on WarehousePG cluster metrics collected by the Collector.
Query performance
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
WHPGLongRunningQuery | Warning | Active query running for > 300 seconds | 30s |
Fires when a query in pg_stat_activity exceeds the 5-minute threshold. The alert identifies the affected database and user.
Disk usage
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
WHPGHighDiskUsage | Warning | Disk usage > 80% on any filesystem | 5m |
WHPGCriticalDiskUsage | Critical | Disk usage > 90% on any filesystem | 2m |
These alerts cover physical filesystems only — tmpfs, overlay, and other virtual filesystems are excluded. The annotation identifies the specific mount point and host.
Segment health
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
WHPGSegmentDown | Critical | Segment status equals 0 (down) | 1m |
Fires when any primary or mirror segment reports a down status. The annotation identifies the affected segment and host.
Data distribution
| Alert | Severity | Condition | Fires after |
|---|---|---|---|
WHPGDataSkewDetected | Warning | Data skew coefficient of variation > 0.5 | 5m |
Fires when row distribution across segments is significantly uneven. A coefficient of variation above 0.5 indicates one or more segments are handling a disproportionate share of the data, which can slow query execution across the cluster.