Apache Iceberg Integration With Analytics Accelerator v1.3

Terminology: Hybrid Manager (HM) is EDB’s control plane for provisioning and operating Postgres, analytics, and AI across environments. Analytics Accelerator (PGAA) refers to EDB’s analytics capabilities that extend Postgres with lakehouse, tiering, and vectorized query features.

What is Apache Iceberg

Apache Iceberg is an open table format that brings database-like reliability to data lakes built on object storage. Unlike traditional file-based approaches, Iceberg provides ACID transactions, schema evolution, and time travel capabilities while maintaining compatibility with standard formats like Parquet and ORC. Analytics Accelerator (PGAA) leverages Iceberg to enable PostgreSQL queries against petabyte-scale data lakes without moving data from object storage.

This page is part of the Analytics Hub. For full navigation, visit: Analytics HubConceptsHow-Tos

Who Uses Iceberg Integration

Data Engineers implement Iceberg tables for reliable data pipelines that require schema evolution and concurrent read-write operations. They value Iceberg's ability to handle streaming ingestion while maintaining query consistency.

Data Analysts query Iceberg tables through Analytics Accelerator using familiar SQL syntax, accessing historical data through time travel queries without understanding underlying storage complexity.

Platform Teams deploy Iceberg as the foundation for multi-engine analytical environments where Spark, Trino, and PostgreSQL must access the same datasets consistently.

Data Scientists leverage Iceberg's versioning capabilities to reproduce experiments and access training data at specific points in time, ensuring model reproducibility.

How to Use Iceberg with Analytics Accelerator

Basic Setup

Connect Analytics Accelerator to an Iceberg catalog to discover and query existing tables:

-- Add Iceberg catalog
SELECT pgaa.add_catalog(
  'data_lake',
  'iceberg-rest',
  '{"url": "https://catalog.company.com",
    "warehouse": "analytics",
    "token": "auth_token"}'
);

-- Attach catalog for querying
SELECT pgaa.attach_catalog('data_lake');

-- Query Iceberg tables directly
SELECT * FROM data_lake.sales.transactions
WHERE transaction_date >= '2024-01-01';

Creating External Tables

Define PostgreSQL tables that reference Iceberg data:

CREATE TABLE customer_events () USING PGAA
WITH (
  pgaa.format = 'iceberg',
  pgaa.managed_by = 'data_lake',
  pgaa.catalog_namespace = 'events',
  pgaa.catalog_table = 'customer_activity'
);

Time Travel Queries

Access historical data without maintaining separate copies:

-- Query data as of specific timestamp
SELECT COUNT(*) FROM sales_facts
FOR SYSTEM_TIME AS OF '2024-12-31 23:59:59';

-- Compare current with historical state
SELECT
  current.product_id,
  current.price as current_price,
  historical.price as year_end_price
FROM sales_facts current
JOIN sales_facts FOR SYSTEM_TIME AS OF '2024-12-31' historical
  ON current.product_id = historical.product_id
WHERE current.price != historical.price;

Understanding Iceberg Architecture

Metadata Structure

Iceberg implements a three-layer metadata architecture that enables its advanced capabilities while maintaining performance at scale.

Metadata Files track the current table state including schema, partitioning, and snapshot history. Each table modification creates a new metadata file, preserving previous versions for time travel.

Manifest Lists organize collections of data files with statistics for partition pruning. Analytics Accelerator uses these statistics to eliminate unnecessary file scanning before query execution begins.

Manifest Files contain column-level statistics for individual data files. These statistics enable predicate pushdown, allowing Analytics Accelerator to skip files that cannot contain matching rows.

Hidden Partitioning

Iceberg abstracts partition complexity from users through hidden partitioning. Queries use natural column values while Iceberg automatically translates these into partition filters:

-- User writes simple query
SELECT * FROM orders WHERE order_date = '2024-03-15';

-- Iceberg automatically prunes to day=2024-03-15 partition
-- without user knowing partitioning scheme

This abstraction enables partition evolution without query modifications. Organizations can refine partitioning strategies as data volumes grow without breaking existing applications.

Transaction Model

Iceberg implements optimistic concurrency control using snapshot isolation. Writers create new snapshots that become visible atomically, ensuring readers observe consistent table states despite concurrent modifications.

Analytics Accelerator leverages this model to provide repeatable query results. Long-running analytical queries continue accessing their snapshot even as streaming systems write new data, eliminating read-write conflicts.

Schema Evolution Capabilities

Iceberg supports comprehensive schema modifications without rewriting data, exposed through Analytics Accelerator's standard SQL commands.

Safe Schema Changes

Add, rename, or drop columns without touching data files:

-- Add column (instant operation)
ALTER TABLE customer_events ADD COLUMN region VARCHAR;

-- Rename column (metadata only)
ALTER TABLE customer_events RENAME COLUMN usr_id TO user_id;

-- Drop column (marks as deleted, data retained)
ALTER TABLE customer_events DROP COLUMN deprecated_field;

Type Promotions

Widen data types safely as requirements evolve:

-- Promote int to bigint
ALTER TABLE transactions ALTER COLUMN amount TYPE BIGINT;

-- Promote float to double
ALTER TABLE metrics ALTER COLUMN value TYPE DOUBLE PRECISION;

Catalog Integration Patterns

Analytics Accelerator supports multiple Iceberg catalog implementations, each suited to different operational requirements.

REST Catalogs

REST catalogs provide vendor-neutral integration through HTTP protocols. AWS Glue, Tabular, and custom implementations expose Iceberg tables through standardized REST APIs:

SELECT pgaa.add_catalog(
  'aws_catalog',
  'iceberg-rest',
  '{"url": "https://glue.us-east-1.amazonaws.com",
    "warehouse": "production",
    "credential": "aws_iam"}'
);

AWS S3 Tables

Native integration with AWS S3 Tables provides serverless Iceberg catalog functionality:

SELECT pgaa.add_catalog(
  's3_tables',
  'iceberg-s3tables',
  '{"arn": "arn:aws:s3tables:us-east-1:123456:bucket/analytics",
    "region": "us-east-1"}'
);

Project Nessie

Nessie adds Git-like branching to Iceberg tables, enabling experimental analyses:

-- Query production branch
SELECT * FROM catalog.main.sales_facts;

-- Query experimental branch
SELECT * FROM catalog.experiment_branch.sales_facts;

Performance Optimization Strategies

File Organization

Optimal file sizing balances parallelism with I/O efficiency. Target 128-512 MB Parquet files through regular compaction:

-- Compact small files for better performance
CALL pgaa.compact_table('catalog.schema.table',
  target_file_size => '256MB');

Sort Optimization

Sorting data within files improves compression and enables efficient skipping:

-- Z-order by frequently filtered columns
CALL pgaa.zorder_table('catalog.schema.table',
  columns => ARRAY['customer_id', 'order_date']);

Metadata Management

Regular maintenance keeps metadata efficient:

-- Expire old snapshots
CALL pgaa.expire_snapshots('catalog.schema.table',
  older_than => CURRENT_TIMESTAMP - INTERVAL '30 days');

-- Remove orphan files
CALL pgaa.remove_orphan_files('catalog.schema.table');

Common Integration Patterns

Tiered Storage Architecture

Combine PostgreSQL operational data with Iceberg historical data:

-- Unified view across tiers
CREATE VIEW sales_unified AS
SELECT * FROM postgres.sales_current
UNION ALL
SELECT * FROM iceberg.sales_historical;

Multi-Engine Processing

Iceberg's open format enables tool specialization:

  • Apache Spark: Complex ETL transformations
  • Analytics Accelerator: Interactive SQL queries
  • Trino: Federated queries across systems
  • Flink: Stream processing and real-time analytics

Streaming Data Integration

Process streaming data with consistent snapshots:

-- Query latest streaming data
SELECT COUNT(*) FROM events
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '1 hour';

-- Streaming platforms write continuously
-- Analytics Accelerator queries see consistent snapshots

Operational Considerations

Monitoring Requirements

Track key metrics for optimal performance:

  • Snapshot count: Indicates metadata growth
  • File count per partition: Identifies compaction needs
  • Metadata operation latency: Reveals catalog performance
  • Query pruning effectiveness: Validates partitioning strategy

Capacity Planning

Iceberg separates storage from compute, enabling independent scaling:

  • Storage: Grows with data volume and snapshot retention
  • Compute: Scales with concurrent query load
  • Metadata: Requires periodic maintenance as tables evolve

Migration Strategy

Adopt Iceberg incrementally to minimize risk:

  1. Pilot Phase: Convert non-critical datasets for validation
  2. Production Migration: Move high-value analytical tables
  3. Full Adoption: Standardize on Iceberg for all analytical data