Background workers Innovation Release

This documentation covers the current Innovation Release of EDB Postgres AI. See also:

Hybrid Manager dual release strategy
Documentation for the current Long-term support release

Background worker is the execution engine for asynchronous AI pipelines in EDB Postgres AI. It allows for high-volume data processing without blocking standard database transactions.

A background worker is activated automatically when a pipeline is created or updated with auto_processing => 'Background'. No additional setup is required for basic use — the worker starts polling immediately based on the configured background_sync_interval. The sections below describe optional tuning and the Postgres-level prerequisite you may need to check.

Core functionality

Asynchronous execution: When a pipeline is set to Background mode, processing occurs independently of the user session. This method ensures that queries or data modifications on the source table are not delayed by embedding generation or OCR tasks.
Batch processing: Background workers group records into configurable batch sizes. This optimizes throughput, especially when interacting with GPU-based models or remote AI service APIs.
Parallel operations: Within each batch, the worker runs pipeline steps (data retrieval, embedding computation, and storage) as parallel operations to maximize performance.
Continuous polling: The worker continuously monitors the source for changes based on a defined background_sync_interval.

Change detection

The background worker handles different source types using specific detection logic:

Table sources: Lightweight triggers capture change events like inserts, updates, and deletes and place them in a backlog. The background worker then processes this backlog at the next interval.
Volume sources: The background worker performs a scan of the external storage. It compares the last_modified timestamps of files against a state table to identify new or changed documents.

Configuration & constraints

Required

Postgres max_worker_processes: Each pipeline in Background mode requires its own dedicated background worker process. Before running multiple background pipelines, verify that the Postgres setting max_worker_processes is set high enough to accommodate them all.

Optional tuning

background_sync_interval: Controls how often the worker checks for new or changed data. For external volumes (like AWS S3), frequent polling can be expensive. Consider setting a longer interval (for example, once per day) to control costs.

Monitoring and observability

You can track the status and health of background workers using the aidb.pipeline_metrics view (also accessible as aidb.pipem). Key metrics include:

Table: unprocessed rows: For table sources, the number of rows not yet processed.
Volume: scans completed: For volume sources, the number of full scans completed.
Count (source records): Total number of records in the source.
Count (destination records): Total number of records in the destination.
Status: Current pipeline status.

← Prev

Auto-processing

↑ Up

Orchestration

Observability