How-To Lakehouse Read With/Without A Catalog v1.3.2

The November 2025 Innovation Release of EDB Postgres AI is available. For more information, see the release notes.

Overview

This guide demonstrates how to configure PGAA Lakehouse clusters for accessing data lake formats, both with and without catalog integration. PGAA supports direct access to Delta Lake and Apache Iceberg® tables, enabling analytical workloads on object storage without data movement.

Prerequisites

System Requirements

PGAA Lakehouse cluster provisioned and accessible
PostgreSQL File System (PGFS) version 2.0.1 or later recommended
Administrative privileges for catalog and storage configuration

Access Requirements

For catalog integration: Valid credentials for target catalog service
For direct access: Read permissions on target S3 buckets or storage locations
Network connectivity to storage endpoints

Configuration Approaches

PGAA supports three primary approaches for lakehouse data access:

Catalog-managed access: Full integration with external catalog services
Direct Delta Lake access: Query Delta tables without catalog dependency
Direct Iceberg access: Query Iceberg tables using metadata files

AWS S3 Tables Integration

Python Client Configuration

AWS S3 Tables requires REST API compatibility for Python clients. Configure PyIceberg as follows:

from pyiceberg.catalog import load_catalog

# Define S3 Tables parameters
REGION = "eu-north-1"  # Your S3 Tables bucket region
ARN = "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket"

s3tables_catalog = load_catalog(
  "s3tables_catalog",
  **{
    "type": "rest",
    "warehouse": ARN,
    "uri": f"https://s3tables.{REGION}.amazonaws.com/iceberg",
    "rest.sigv4-enabled": "true",
    "rest.signing-name": "s3tables",
    "rest.signing-region": REGION,
  }
)

Note: PyIceberg versions ≤0.9.1 perform credential resolution at each API call and cannot pass S3 options directly during catalog definition.

PGAA Native Configuration

PGAA implements a native Rust S3 Tables client, eliminating REST proxy overhead:

SELECT bdr.replicate_ddl_command($$
  SELECT pgaa.delete_catalog('s3tables_catalog');
  SELECT pgaa.add_catalog(
    's3tables_catalog',
    'iceberg-s3tables',
    '{"arn": "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket",
      "region": "eu-north-1"}'
  );
  SELECT pgaa.import_catalog('s3tables_catalog');
$$);

Key Differences:

Catalog type: iceberg-s3tables (not iceberg-rest)
Required parameters: ARN and region only
Authentication: AWS credential chain (profiles, environment variables, IMDS)
Explicit credentials not supported; relies on SDK credential resolution

Direct Delta Lake Access

Storage Location Setup

Configure storage locations for Delta Lake tables:

-- PGFS 2.0.1 and later
SELECT pgfs.create_storage_location(
  'biganimal-sample-data',
  's3://beacon-analytics-demo-data-us-east-1-prod',
  '{"aws_skip_signature": "true"}'
);

-- Legacy syntax (before PGFS 2.0.1)
SELECT pgfs.create_storage_location(
  'biganimal-sample-data',
  's3://beacon-analytics-demo-data-eu-west-1-prod',
  NULL,  -- managed storage location ID (unused)
  '{"aws_skip_signature": "true"}',  -- options
  '{}'   -- credentials
);

Creating Delta Tables

Map Delta Lake tables using PGAA table access method:

-- Create customer table
CREATE TABLE public.customer () USING PGAA
WITH (
  pgaa.storage_location = 'biganimal-sample-data',
  pgaa.path = 'tpch_sf_1/customer'
);

-- Verify table access
SELECT count(*) FROM public.customer;

-- Create additional TPC-H tables
CREATE TABLE public.lineitem () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/lineitem');

CREATE TABLE public.orders () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/orders');

CREATE TABLE public.nation () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/nation');

Working with Large Datasets

Configure separate schema for scale testing:

CREATE SCHEMA tpch;

-- Create 1TB scale factor tables
CREATE TABLE tpch.lineitem () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/lineitem');

CREATE TABLE tpch.customer () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/customer');

-- Additional tables follow same pattern...

Direct Iceberg Access

Configuring Iceberg Storage

-- Create storage location for Iceberg data
SELECT pgfs.create_storage_location(
  'biganimal-sample-data-dev',
  's3://beacon-analytics-demo-data-us-east-1-dev',
  '{"aws_skip_signature": "true"}'
);

-- Create Iceberg table reference
CREATE TABLE iceberg_table () USING PGAA
WITH (
  pgaa.storage_location = 'biganimal-sample-data-dev',
  pgaa.path = 'iceberg-example/default.db/iceberg_table',
  pgaa.format = 'iceberg'  -- Explicitly specify format
);

-- Query Iceberg data
SELECT * FROM iceberg_table ORDER BY key ASC;

Catalog Attachment

Connecting to Existing Catalogs

Attach and query data from configured catalogs:

-- Add REST catalog
SELECT pgaa.add_catalog(
  'lakekeeper-test',
  'iceberg-rest',
  '{
     "url": "https://catalog-endpoint.example.com",
     "token": "your-api-token",
     "warehouse": "warehouse-id",
     "danger_accept_invalid_certs": "true"
  }'
);

-- Attach catalog to current session
SELECT pgaa.attach_catalog('lakekeeper-test');

-- Query catalog tables
SELECT COUNT(*) FROM tpch_sf_1.lineitem;

Verifying Catalog Access

-- Check available schemas
SELECT * FROM information_schema.schemata
WHERE schema_name LIKE 'tpch%';

-- Verify offloaded data views
SELECT * FROM partitioned_table_offloaded LIMIT 10;

Query Validation

Sample Analytical Query

Validate configuration with complex joins:

SELECT
    c_custkey,
    c_name,
    sum(l_extendedprice * (1 - l_discount)) AS revenue,
    c_acctbal,
    n_name,
    c_address,
    c_phone,
    c_comment
FROM
    customer,
    orders,
    lineitem,
    nation
WHERE
    c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate >= CAST('1993-10-01' AS date)
    AND o_orderdate < CAST('1994-01-01' AS date)
    AND l_returnflag = 'R'
    AND c_nationkey = n_nationkey
GROUP BY
    c_custkey, c_name, c_acctbal,
    c_phone, n_name, c_address, c_comment
ORDER BY
    revenue DESC
LIMIT 20;

Expected Results:

Customer#000057040: revenue 734235.2455
Customer#000143347: revenue 721002.6948
Customer#000060838: revenue 679127.3077

Best Practices

Storage Configuration

Use PGFS 2.0.1+ simplified syntax when available
Configure region-appropriate endpoints for optimal performance
Implement proper credential management through AWS credential chain

Catalog Management

Regularly refresh OAuth tokens for REST catalogs
Use native clients when available (S3 Tables)
Import catalog metadata after configuration changes

Performance Optimization

Partition large datasets appropriately
Create materialized views for frequently accessed data
Monitor query execution plans for optimization opportunities

Troubleshooting

Common Issues

Storage Access Errors

Verify S3 bucket permissions and network connectivity
Check credential chain configuration for AWS authentication

Catalog Connection Failures

Validate catalog endpoint URLs and credentials
Ensure proper SSL certificate validation settings

Query Performance Degradation

Review table statistics and partition pruning
Consider data locality and caching strategies

Next Steps

Configure monitoring for lakehouse queries
Establish data refresh schedules for materialized views
Implement cost optimization through intelligent caching
Develop governance policies for catalog access control

← Prev

How-To Integrate with Third-Party Iceberg Catalogs

↑ Up

Analytics Accelerator

Configure Analytics Storage and Data Tiering with PGAA and PGD