Apache Iceberg® Integration With Analytics Accelerator v1.3.4
Terminology: Hybrid Manager (HM) is EDB’s control plane for provisioning and operating Postgres, analytics, and AI across environments. Analytics Accelerator (PGAA) refers to EDB’s analytics capabilities that extend Postgres with lakehouse, tiering, and vectorized query features.
What is Apache Iceberg®
Apache Iceberg® is an open table format that brings database-like reliability to data lakes built on object storage. Unlike traditional file-based approaches, Iceberg provides ACID transactions, schema evolution, and time travel capabilities while maintaining compatibility with standard formats like Parquet and ORC. Analytics Accelerator (PGAA) leverages Iceberg to enable PostgreSQL queries against petabyte-scale data lakes without moving data from object storage.
This page is part of the Analytics Hub. For full navigation, visit: Analytics Hub — Concepts — How-Tos
Who Uses Iceberg Integration
Data Engineers implement Iceberg tables for reliable data pipelines that require schema evolution and concurrent read-write operations. They value Iceberg's ability to handle streaming ingestion while maintaining query consistency.
Data Analysts query Iceberg tables through Analytics Accelerator using familiar SQL syntax, accessing historical data through time travel queries without understanding underlying storage complexity.
Platform Teams deploy Iceberg as the foundation for multi-engine analytical environments where Spark, Trino, and PostgreSQL must access the same datasets consistently.
Data Scientists leverage Iceberg's versioning capabilities to reproduce experiments and access training data at specific points in time, ensuring model reproducibility.
How to Use Iceberg with Analytics Accelerator
Basic Setup
Connect Analytics Accelerator to an Iceberg catalog to discover and query existing tables:
-- Add Iceberg catalog SELECT pgaa.add_catalog( 'data_lake', 'iceberg-rest', '{"url": "https://catalog.company.com", "warehouse": "analytics", "token": "auth_token"}' ); -- Attach catalog for querying SELECT pgaa.attach_catalog('data_lake'); -- Query Iceberg tables directly SELECT * FROM data_lake.sales.transactions WHERE transaction_date >= '2024-01-01';
Creating External Tables
Define PostgreSQL tables that reference Iceberg data:
CREATE TABLE customer_events () USING PGAA WITH ( pgaa.format = 'iceberg', pgaa.managed_by = 'data_lake', pgaa.catalog_namespace = 'events', pgaa.catalog_table = 'customer_activity' );
Time Travel Queries
Access historical data without maintaining separate copies:
-- Query data as of specific timestamp SELECT COUNT(*) FROM sales_facts FOR SYSTEM_TIME AS OF '2024-12-31 23:59:59'; -- Compare current with historical state SELECT current.product_id, current.price as current_price, historical.price as year_end_price FROM sales_facts current JOIN sales_facts FOR SYSTEM_TIME AS OF '2024-12-31' historical ON current.product_id = historical.product_id WHERE current.price != historical.price;
Understanding Iceberg Architecture
Metadata Structure
Iceberg implements a three-layer metadata architecture that enables its advanced capabilities while maintaining performance at scale.
Metadata Files track the current table state including schema, partitioning, and snapshot history. Each table modification creates a new metadata file, preserving previous versions for time travel.
Manifest Lists organize collections of data files with statistics for partition pruning. Analytics Accelerator uses these statistics to eliminate unnecessary file scanning before query execution begins.
Manifest Files contain column-level statistics for individual data files. These statistics enable predicate pushdown, allowing Analytics Accelerator to skip files that cannot contain matching rows.
Hidden Partitioning
Iceberg abstracts partition complexity from users through hidden partitioning. Queries use natural column values while Iceberg automatically translates these into partition filters:
-- User writes simple query SELECT * FROM orders WHERE order_date = '2024-03-15'; -- Iceberg automatically prunes to day=2024-03-15 partition -- without user knowing partitioning scheme
This abstraction enables partition evolution without query modifications. Organizations can refine partitioning strategies as data volumes grow without breaking existing applications.
Transaction Model
Iceberg implements optimistic concurrency control using snapshot isolation. Writers create new snapshots that become visible atomically, ensuring readers observe consistent table states despite concurrent modifications.
Analytics Accelerator leverages this model to provide repeatable query results. Long-running analytical queries continue accessing their snapshot even as streaming systems write new data, eliminating read-write conflicts.
Schema Evolution Capabilities
Iceberg supports comprehensive schema modifications without rewriting data, exposed through Analytics Accelerator's standard SQL commands.
Safe Schema Changes
Add, rename, or drop columns without touching data files: