EDB Postgres Lakehouse Architecture v1.3

What is EDB Postgres Lakehouse

EDB Postgres Lakehouse extends PostgreSQL's analytical capabilities to operate directly on data lake storage, combining the reliability of traditional databases with the scalability and cost-efficiency of modern object storage. This architecture enables organizations to run complex SQL analytics against petabyte-scale datasets stored in S3, Azure Blob Storage, or Google Cloud Storage without moving data into expensive analytical databases.

This page is part of the Analytics Hub. For full navigation, visit: Analytics HubConceptsHow-Tos

The lakehouse approach eliminates the traditional separation between data warehouses and data lakes. Instead of maintaining duplicate data across systems, Analytics Accelerator (PGAA) queries data directly where it resides, using open table formats like Apache Iceberg and Delta Lake to provide ACID transactions, schema enforcement, and time travel capabilities on object storage.

Who Benefits from Lakehouse Architecture

Database Administrators leverage lakehouse capabilities to implement cost-effective tiered storage strategies, automatically offloading historical data from expensive PostgreSQL storage to object storage while maintaining full query access.

Data Engineers build simplified data pipelines that write once to object storage, eliminating complex ETL processes between operational and analytical systems. The lakehouse architecture supports both batch and streaming workloads through unified table formats.

Data Analysts access all organizational data through familiar SQL interfaces, querying both operational PostgreSQL data and historical lakehouse tables without understanding underlying storage complexity.

Platform Teams implement unified analytical architectures supporting multiple compute engines. Spark processes complex transformations while Analytics Accelerator serves interactive queries, all accessing the same underlying data.

How to Deploy Lakehouse Architecture

Basic Deployment Pattern

Deploy dedicated Lakehouse nodes through Hybrid Manager or self-managed installations:

-- Configure storage location for lakehouse data
SELECT pgfs.create_storage_location(
  'lakehouse_storage',
  's3://analytics-bucket/lakehouse',
  '{"region": "us-east-1"}'
);

-- Create external table over Iceberg data
CREATE TABLE sales_history () USING PGAA
WITH (
  pgaa.format = 'iceberg',
  pgaa.storage_location = 'lakehouse_storage',
  pgaa.path = 'warehouse/sales.db/transactions'
);

-- Query seamlessly across storage tiers
SELECT
  DATE_TRUNC('month', sale_date) as month,
  SUM(amount) as revenue
FROM sales_history
WHERE sale_date >= '2024-01-01'
GROUP BY 1;

Unified OLTP + OLAP Architecture

Combine operational and analytical workloads in a single platform:

-- Hot data in PostgreSQL
CREATE TABLE sales_current (
  sale_id BIGINT PRIMARY KEY,
  sale_date DATE,
  amount DECIMAL(10,2)
);

-- Cold data in lakehouse
CREATE TABLE sales_archive () USING PGAA
WITH (
  pgaa.format = 'iceberg',
  pgaa.managed_by = 'lakehouse_catalog',
  pgaa.catalog_table = 'sales_historical'
);

-- Unified view across tiers
CREATE VIEW sales_complete AS
SELECT * FROM sales_current
UNION ALL
SELECT * FROM sales_archive;

Architecture Components

Stateless Lakehouse Nodes

Lakehouse nodes provide dedicated computational resources for analytical workloads without storing data locally. This stateless design enables elastic scaling based on query demand rather than data volume.

Each node runs the full PostgreSQL stack enhanced with Analytics Accelerator extensions. The familiar PostgreSQL interface ensures compatibility with existing tools and applications while the underlying vectorized engine delivers performance optimized for analytical queries.

Nodes automatically coordinate through shared metadata, enabling parallel query execution across multiple instances. Load balancers distribute queries across available nodes, providing both high availability and horizontal scalability.

Vectorized Query Engine

Apache DataFusion powers the vectorized execution engine, processing columnar data in batches rather than row-by-row. This approach maximizes CPU cache utilization and enables SIMD optimizations that dramatically accelerate analytical operations.

The engine automatically pushes filters and projections to the storage layer, minimizing data transfer. Partition pruning eliminates entire data files from consideration before reading begins. These optimizations combine to deliver query performance approaching specialized analytical databases while maintaining PostgreSQL compatibility.

Adaptive query execution adjusts processing strategies based on actual data characteristics discovered during execution. Initial statistics guide planning, but the engine dynamically optimizes join orders and aggregation strategies as queries progress.

Storage Abstraction Layer

PostgreSQL File System (PGFS) provides unified access to diverse object storage systems. This abstraction layer handles authentication, protocol differences, and performance optimizations transparently.

The system supports multiple storage backends simultaneously, enabling queries that span cloud providers or combine cloud and on-premises storage. Intelligent caching reduces repeated metadata operations while respecting consistency requirements of underlying table formats.

Connection pooling and retry logic handle transient network issues common in cloud environments. The storage layer automatically adjusts request patterns based on observed latency and throughput characteristics.

Table Format Integration

Native support for Apache Iceberg and Delta Lake enables interoperability with the broader data ecosystem. Analytics Accelerator reads table metadata to understand schema, partitioning, and file locations without requiring manual configuration.

For Iceberg tables, the system supports the complete feature set including schema evolution, hidden partitioning, and time travel queries. Catalog integration enables automatic table discovery and metadata synchronization across multiple compute engines.

Delta Lake support currently focuses on read operations, providing access to existing data lake investments. The transaction log parsing occurs transparently, presenting Delta tables as standard PostgreSQL relations.

Performance Characteristics

Query Optimization Strategies

The query optimizer understands lakehouse-specific characteristics when generating execution plans. Statistics from table format metadata guide join ordering and aggregation strategies. The optimizer balances partition pruning effectiveness against metadata operation overhead.

Predicate pushdown extends beyond simple filters to include complex expressions and joins. The system rewrites queries to maximize operations performed at the storage layer, reducing data movement and computational requirements.

Cost-based optimization considers network latency, data locality, and format-specific characteristics. Plans adapt to whether data resides in cache, regional storage, or requires cross-region transfer.

Caching Architecture

Multiple caching layers optimize repeated access patterns common in analytical workloads. Metadata caching eliminates repeated catalog lookups and schema discovery operations. File-level caching stores frequently accessed Parquet files in local SSDs.

Query result caching serves dashboard and reporting workloads where identical queries execute repeatedly. The cache invalidation mechanism respects table format versioning, ensuring consistency despite underlying data changes.

Adaptive cache management prioritizes retention based on access patterns, query cost, and available resources. Hot data remains cached while cold data ages out, maintaining optimal resource utilization.

Scalability Considerations

Horizontal scaling adds nodes to handle increased query concurrency without disrupting running workloads. The stateless architecture ensures new nodes immediately contribute to query processing capacity.

Vertical scaling adjusts individual node resources for complex analytical queries requiring more memory or CPU. The system automatically adjusts parallelism based on available resources, maximizing utilization without causing resource exhaustion.

Storage scaling occurs independently through object storage, supporting virtually unlimited data growth. The separation of compute and storage enables organizations to retain years of historical data without proportional compute costs.

Integration Patterns

Tiered Storage Implementation

Organizations implement tiered storage by combining PostgreSQL operational tables with lakehouse historical tables. Recent data requiring transactional consistency remains in PostgreSQL while aged data migrates to object storage.

Automated migration policies move data based on age, access patterns, or business rules. The migration process preserves data integrity while transforming row-oriented PostgreSQL data into columnar Parquet files optimized for analytical queries.

Transparent query routing directs queries to appropriate tiers based on predicates. Time-based queries automatically union results from both tiers, providing complete analytical views without manual query modification.

Multi-Engine Analytics

The lakehouse architecture supports specialized compute engines for different workload types. Apache Spark handles complex ETL transformations, Analytics Accelerator serves interactive SQL queries, and machine learning platforms access training data directly.

Shared table formats ensure consistency regardless of which engine performs modifications. Transaction isolation prevents conflicts between concurrent operations from different engines. This specialization enables organizations to use the best tool for each job while maintaining a single source of truth.

Streaming Analytics

Modern streaming platforms write continuously to lakehouse tables using micro-batch commits. Analytics Accelerator provides near real-time visibility through snapshot isolation that ensures consistent query results despite ongoing updates.

Event streaming systems like Apache Kafka or Amazon Kinesis deliver data to lakehouse tables for durable storage and analytical processing. The combination of streaming ingestion and SQL analytics enables use cases like real-time fraud detection, operational monitoring, and customer behavior analysis.

Operational Considerations

Deployment Models

Managed Deployments through Hybrid Manager provide automated provisioning, scaling, and maintenance. The platform handles node lifecycle, security updates, and performance optimization transparently.

Self-Managed Deployments offer complete control over infrastructure and configuration. Organizations with specific compliance requirements or existing Kubernetes infrastructure often choose self-managed deployment.

Hybrid Deployments combine managed control plane with self-managed data plane. This approach balances operational simplicity with data sovereignty requirements common in regulated industries.

Monitoring Requirements

Effective monitoring tracks query performance, resource utilization, and storage access patterns. Query execution time, data scanned, and partition pruning effectiveness indicate optimization opportunities.

Storage metrics reveal access patterns that inform partitioning strategies and caching policies. Hot spots in data access suggest candidates for materialized views or alternative partition schemes.

Resource utilization metrics guide capacity planning and scaling decisions. CPU, memory, and network utilization patterns indicate whether workloads are compute, memory, or I/O bound.

Security Architecture

The lakehouse security model implements defense in depth across multiple layers. PostgreSQL role-based access control manages user permissions while object storage IAM policies control data access.

Encryption protects data at rest using cloud provider encryption services or customer-managed keys. Transport layer security encrypts data in motion between compute nodes and storage systems.

Audit logging tracks all data access for compliance and security analysis. Query logs, storage access logs, and authentication events provide complete visibility into system usage.