How-To Lakehouse Read With/Without A Catalog v1.3

Overview

This guide demonstrates how to configure PGAA Lakehouse clusters for accessing data lake formats, both with and without catalog integration. PGAA supports direct access to Delta Lake and Apache Iceberg tables, enabling analytical workloads on object storage without data movement.

Prerequisites

System Requirements

  • PGAA Lakehouse cluster provisioned and accessible
  • PostgreSQL File System (PGFS) version 2.0.1 or later recommended
  • Administrative privileges for catalog and storage configuration

Access Requirements

  • For catalog integration: Valid credentials for target catalog service
  • For direct access: Read permissions on target S3 buckets or storage locations
  • Network connectivity to storage endpoints

Configuration Approaches

PGAA supports three primary approaches for lakehouse data access:

  1. Catalog-managed access: Full integration with external catalog services
  2. Direct Delta Lake access: Query Delta tables without catalog dependency
  3. Direct Iceberg access: Query Iceberg tables using metadata files

AWS S3 Tables Integration

Python Client Configuration

AWS S3 Tables requires REST API compatibility for Python clients. Configure PyIceberg as follows:

from pyiceberg.catalog import load_catalog

# Define S3 Tables parameters
REGION = "eu-north-1"  # Your S3 Tables bucket region
ARN = "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket"

s3tables_catalog = load_catalog(
  "s3tables_catalog",
  **{
    "type": "rest",
    "warehouse": ARN,
    "uri": f"https://s3tables.{REGION}.amazonaws.com/iceberg",
    "rest.sigv4-enabled": "true",
    "rest.signing-name": "s3tables",
    "rest.signing-region": REGION,
  }
)

Note: PyIceberg versions ≤0.9.1 perform credential resolution at each API call and cannot pass S3 options directly during catalog definition.

PGAA Native Configuration

PGAA implements a native Rust S3 Tables client, eliminating REST proxy overhead:

SELECT bdr.replicate_ddl_command($$
  SELECT pgaa.delete_catalog('s3tables_catalog');
  SELECT pgaa.add_catalog(
    's3tables_catalog',
    'iceberg-s3tables',
    '{"arn": "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket",
      "region": "eu-north-1"}'
  );
  SELECT pgaa.import_catalog('s3tables_catalog');
$$);

Key Differences:

  • Catalog type: iceberg-s3tables (not iceberg-rest)
  • Required parameters: ARN and region only
  • Authentication: AWS credential chain (profiles, environment variables, IMDS)
  • Explicit credentials not supported; relies on SDK credential resolution

Direct Delta Lake Access

Storage Location Setup

Configure storage locations for Delta Lake tables:

-- PGFS 2.0.1 and later
SELECT pgfs.create_storage_location(
  'biganimal-sample-data',
  's3://beacon-analytics-demo-data-us-east-1-prod',
  '{"aws_skip_signature": "true"}'
);

-- Legacy syntax (before PGFS 2.0.1)
SELECT pgfs.create_storage_location(
  'biganimal-sample-data',
  's3://beacon-analytics-demo-data-eu-west-1-prod',
  NULL,  -- managed storage location ID (unused)
  '{"aws_skip_signature": "true"}',  -- options
  '{}'   -- credentials
);

Creating Delta Tables

Map Delta Lake tables using PGAA table access method:

-- Create customer table
CREATE TABLE public.customer () USING PGAA
WITH (
  pgaa.storage_location = 'biganimal-sample-data',
  pgaa.path = 'tpch_sf_1/customer'
);

-- Verify table access
SELECT count(*) FROM public.customer;

-- Create additional TPC-H tables
CREATE TABLE public.lineitem () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/lineitem');

CREATE TABLE public.orders () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/orders');

CREATE TABLE public.nation () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/nation');

Working with Large Datasets

Configure separate schema for scale testing:

CREATE SCHEMA tpch;

-- Create 1TB scale factor tables
CREATE TABLE tpch.lineitem () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/lineitem');

CREATE TABLE tpch.customer () USING PGAA
WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/customer');

-- Additional tables follow same pattern...

Direct Iceberg Access

Configuring Iceberg Storage

-- Create storage location for Iceberg data
SELECT pgfs.create_storage_location(
  'biganimal-sample-data-dev',
  's3://beacon-analytics-demo-data-us-east-1-dev',
  '{"aws_skip_signature": "true"}'
);

-- Create Iceberg table reference
CREATE TABLE iceberg_table () USING PGAA
WITH (
  pgaa.storage_location = 'biganimal-sample-data-dev',
  pgaa.path = 'iceberg-example/default.db/iceberg_table',
  pgaa.format = 'iceberg'  -- Explicitly specify format
);

-- Query Iceberg data
SELECT * FROM iceberg_table ORDER BY key ASC;

Catalog Attachment

Connecting to Existing Catalogs

Attach and query data from configured catalogs:

-- Add REST catalog
SELECT pgaa.add_catalog(
  'lakekeeper-test',
  'iceberg-rest',
  '{
     "url": "https://catalog-endpoint.example.com",
     "token": "your-api-token",
     "warehouse": "warehouse-id",
     "danger_accept_invalid_certs": "true"
  }'
);

-- Attach catalog to current session
SELECT pgaa.attach_catalog('lakekeeper-test');

-- Query catalog tables
SELECT COUNT(*) FROM tpch_sf_1.lineitem;

Verifying Catalog Access

-- Check available schemas
SELECT * FROM information_schema.schemata
WHERE schema_name LIKE 'tpch%';

-- Verify offloaded data views
SELECT * FROM partitioned_table_offloaded LIMIT 10;

Query Validation

Sample Analytical Query

Validate configuration with complex joins:

SELECT
    c_custkey,
    c_name,
    sum(l_extendedprice * (1 - l_discount)) AS revenue,
    c_acctbal,
    n_name,
    c_address,
    c_phone,
    c_comment
FROM
    customer,
    orders,
    lineitem,
    nation
WHERE
    c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND o_orderdate >= CAST('1993-10-01' AS date)
    AND o_orderdate < CAST('1994-01-01' AS date)
    AND l_returnflag = 'R'
    AND c_nationkey = n_nationkey
GROUP BY
    c_custkey, c_name, c_acctbal,
    c_phone, n_name, c_address, c_comment
ORDER BY
    revenue DESC
LIMIT 20;

Expected Results:

  • Customer#000057040: revenue 734235.2455
  • Customer#000143347: revenue 721002.6948
  • Customer#000060838: revenue 679127.3077

Best Practices

Storage Configuration

  1. Use PGFS 2.0.1+ simplified syntax when available
  2. Configure region-appropriate endpoints for optimal performance
  3. Implement proper credential management through AWS credential chain

Catalog Management

  1. Regularly refresh OAuth tokens for REST catalogs
  2. Use native clients when available (S3 Tables)
  3. Import catalog metadata after configuration changes

Performance Optimization

  1. Partition large datasets appropriately
  2. Create materialized views for frequently accessed data
  3. Monitor query execution plans for optimization opportunities

Troubleshooting

Common Issues

Storage Access Errors

  • Verify S3 bucket permissions and network connectivity
  • Check credential chain configuration for AWS authentication

Catalog Connection Failures

  • Validate catalog endpoint URLs and credentials
  • Ensure proper SSL certificate validation settings

Query Performance Degradation

  • Review table statistics and partition pruning
  • Consider data locality and caching strategies

Next Steps

  1. Configure monitoring for lakehouse queries
  2. Establish data refresh schedules for materialized views
  3. Implement cost optimization through intelligent caching
  4. Develop governance policies for catalog access control