How-To Integrate with Third-Party Iceberg Catalogs v1.3

Overview

This guide provides instructions for configuring PostgreSQL Advanced Analytics (PGAA) to work with third-party Apache Iceberg catalogs, enabling seamless data lake integration for analytical workloads. PGAA supports reading from and writing to Iceberg tables through external catalog services, allowing you to leverage existing data lake infrastructure alongside your PostgreSQL deployments.

Context and Goals

What This Guide Covers

This documentation demonstrates how to:

  • Configure PGAA to connect with third-party Iceberg catalogs
  • Enable bidirectional data flow between PostgreSQL and Iceberg tables
  • Set up catalog integration for PGD (Postgres Distributed) offloads
  • Ensure compatibility with external analytical tools

Supported Catalog Types

PGAA currently supports integration with:

  • Snowflake Open Data Catalog (Apache Polaris): Verified compatibility with manual OAuth token management
  • AWS S3 Tables: Native integration using table buckets

Prerequisites

Before beginning this integration, ensure you have:

Required Components

  • Active PGAA installation with appropriate licensing
  • PostgreSQL cluster with BDR (Bi-Directional Replication) configured
  • Administrative access to PGAA catalog management functions

For Snowflake Polaris Integration

  • Snowflake trial or production account
  • AWS S3 bucket configured in your organization's AWS account
  • IT approval for Snowflake to access your S3 bucket
  • OAuth client credentials from Snowflake

For AWS S3 Tables Integration

  • AWS account with S3 Tables enabled
  • Appropriate IAM permissions for S3 Tables management
  • S3 bucket configured as a table bucket

Practical Applications

This integration enables several key use cases:

  1. Analytical Offloading: Move historical data from operational PostgreSQL databases to cost-effective object storage
  2. Data Lake Integration: Query Iceberg tables directly from PostgreSQL without data movement
  3. Hybrid Analytics: Combine real-time PostgreSQL data with historical Iceberg data in unified queries
  4. Cross-Platform Analytics: Share data between PostgreSQL and other analytical engines through a common Iceberg catalog

Configuration Instructions

Setting Up Snowflake Polaris Integration

Step 1: Obtain OAuth Token

Due to current limitations, OAuth tokens must be obtained manually and have a 1-hour lifetime. Use the following Python script to retrieve your token:

from pyiceberg.catalog import load_catalog
from pyiceberg.catalog.rest import RestCatalog

# Replace with your Snowflake OAuth credentials
CLIENT_ID = "your_client_id"
CLIENT_SECRET = "your_client_secret"
TOKEN = f"{CLIENT_ID}:{CLIENT_SECRET}"

# Replace with your Snowflake Polaris endpoint
URI = "https://your-account.snowflakecomputing.com/polaris/api/catalog"

catalog = RestCatalog(
    name="PGAA Test",
    warehouse="pgaa_test",
    uri=URI,
    credential=TOKEN,
    scope="PRINCIPAL_ROLE:read_write_all"
)

# This will print your temporary token
print(catalog.properties["token"])

Important: Save the token output - you'll need it for the next step.

Step 2: Configure PGAA Catalog

Execute the following SQL commands to add and configure the Snowflake Polaris catalog in PGAA:

-- Remove existing catalog if present
SELECT bdr.replicate_ddl_command($$
  SELECT pgaa.delete_catalog('snowflake_polaris');
$$);

-- Add new catalog configuration
SELECT bdr.replicate_ddl_command($$
  SELECT pgaa.add_catalog(
    'snowflake_polaris',
    'iceberg-rest',
    '{"url": "https://your-account.snowflakecomputing.com/polaris/api/catalog",
      "token": "YOUR_OAUTH_TOKEN_HERE",
      "warehouse": "pgaa_test"}'
  );
$$);

-- Import catalog metadata
SELECT bdr.replicate_ddl_command($$
  SELECT pgaa.import_catalog('snowflake_polaris');
$$);

Step 3: Configure Write Permissions

Enable write operations to the catalog by configuring your BDR node group:

SELECT bdr.alter_node_group_option(
  (SELECT node_group_name FROM bdr.node_group
   WHERE node_group_name != 'world' LIMIT 1),
  'analytics_write_catalog',
  'snowflake_polaris'
);

Setting Up AWS S3 Tables Integration

For S3 Tables integration, follow the AWS S3 Tables Getting Started Guide to create your table bucket, then use iceberg-s3tables as the catalog type when configuring PGAA:

SELECT bdr.replicate_ddl_command($$
  SELECT pgaa.add_catalog(
    's3_tables_catalog',
    'iceberg-s3tables',
    '{"bucket": "your-table-bucket",
      "region": "us-east-1",
      "access_key": "YOUR_ACCESS_KEY",
      "secret_key": "YOUR_SECRET_KEY"}'
  );
$$);

Verification and Testing

After configuration, verify your integration by testing these operations:

Reading from Catalog

-- List available tables
SELECT * FROM pgaa.catalog_tables('your_catalog_name');

-- Query an Iceberg table
SELECT * FROM your_catalog_name.schema_name.table_name LIMIT 10;

Writing to Catalog

-- Create a new Iceberg table
CREATE TABLE your_catalog_name.schema_name.new_table AS
SELECT * FROM your_postgresql_table;

PGD Offloads

Verify that PGD can successfully offload data to the configured catalog according to your retention policies.

External Tool Compatibility

Test that third-party analytical tools can read both tables and views created through PGAA.

Known Limitations

  • Token Management: Snowflake Polaris tokens expire after 1 hour and must be manually refreshed
  • Naming Restrictions: Catalogs with spaces in their names may cause query errors
  • OAuth Flow: The OAuth client credentials flow must be performed manually outside of PGAA

Troubleshooting

Common Issues

Token Expiration Error

  • Symptom: Queries fail with authentication errors after approximately 1 hour
  • Solution: Regenerate the OAuth token and update the catalog configuration

Catalog Import Failures

  • Symptom: pgaa.import_catalog() returns errors
  • Solution: Verify network connectivity and ensure the catalog URL is accessible from all PostgreSQL nodes

Query Performance Issues

  • Symptom: Slow query performance when accessing Iceberg tables
  • Solution: Ensure proper partition pruning and consider creating materialized views for frequently accessed data

Next Steps

After successfully configuring catalog integration:

  1. Establish token refresh procedures for production environments
  2. Configure monitoring for catalog connectivity
  3. Implement data lifecycle policies for automated offloading
  4. Optimize query patterns for mixed PostgreSQL-Iceberg workloads