How-To Integrate with Third-Party Iceberg Catalogs v1.3
Overview
This guide provides instructions for configuring PostgreSQL Advanced Analytics (PGAA) to work with third-party Apache Iceberg catalogs, enabling seamless data lake integration for analytical workloads. PGAA supports reading from and writing to Iceberg tables through external catalog services, allowing you to leverage existing data lake infrastructure alongside your PostgreSQL deployments.
Context and Goals
What This Guide Covers
This documentation demonstrates how to:
- Configure PGAA to connect with third-party Iceberg catalogs
- Enable bidirectional data flow between PostgreSQL and Iceberg tables
- Set up catalog integration for PGD (Postgres Distributed) offloads
- Ensure compatibility with external analytical tools
Supported Catalog Types
PGAA currently supports integration with:
- Snowflake Open Data Catalog (Apache Polaris): Verified compatibility with manual OAuth token management
- AWS S3 Tables: Native integration using table buckets
Prerequisites
Before beginning this integration, ensure you have:
Required Components
- Active PGAA installation with appropriate licensing
- PostgreSQL cluster with BDR (Bi-Directional Replication) configured
- Administrative access to PGAA catalog management functions
For Snowflake Polaris Integration
- Snowflake trial or production account
- AWS S3 bucket configured in your organization's AWS account
- IT approval for Snowflake to access your S3 bucket
- OAuth client credentials from Snowflake
For AWS S3 Tables Integration
- AWS account with S3 Tables enabled
- Appropriate IAM permissions for S3 Tables management
- S3 bucket configured as a table bucket
Practical Applications
This integration enables several key use cases:
- Analytical Offloading: Move historical data from operational PostgreSQL databases to cost-effective object storage
- Data Lake Integration: Query Iceberg tables directly from PostgreSQL without data movement
- Hybrid Analytics: Combine real-time PostgreSQL data with historical Iceberg data in unified queries
- Cross-Platform Analytics: Share data between PostgreSQL and other analytical engines through a common Iceberg catalog
Configuration Instructions
Setting Up Snowflake Polaris Integration
Step 1: Obtain OAuth Token
Due to current limitations, OAuth tokens must be obtained manually and have a 1-hour lifetime. Use the following Python script to retrieve your token:
from pyiceberg.catalog import load_catalog from pyiceberg.catalog.rest import RestCatalog # Replace with your Snowflake OAuth credentials CLIENT_ID = "your_client_id" CLIENT_SECRET = "your_client_secret" TOKEN = f"{CLIENT_ID}:{CLIENT_SECRET}" # Replace with your Snowflake Polaris endpoint URI = "https://your-account.snowflakecomputing.com/polaris/api/catalog" catalog = RestCatalog( name="PGAA Test", warehouse="pgaa_test", uri=URI, credential=TOKEN, scope="PRINCIPAL_ROLE:read_write_all" ) # This will print your temporary token print(catalog.properties["token"])
Important: Save the token output - you'll need it for the next step.
Step 2: Configure PGAA Catalog
Execute the following SQL commands to add and configure the Snowflake Polaris catalog in PGAA:
-- Remove existing catalog if present SELECT bdr.replicate_ddl_command($$ SELECT pgaa.delete_catalog('snowflake_polaris'); $$); -- Add new catalog configuration SELECT bdr.replicate_ddl_command($$ SELECT pgaa.add_catalog( 'snowflake_polaris', 'iceberg-rest', '{"url": "https://your-account.snowflakecomputing.com/polaris/api/catalog", "token": "YOUR_OAUTH_TOKEN_HERE", "warehouse": "pgaa_test"}' ); $$); -- Import catalog metadata SELECT bdr.replicate_ddl_command($$ SELECT pgaa.import_catalog('snowflake_polaris'); $$);
Step 3: Configure Write Permissions
Enable write operations to the catalog by configuring your BDR node group:
SELECT bdr.alter_node_group_option( (SELECT node_group_name FROM bdr.node_group WHERE node_group_name != 'world' LIMIT 1), 'analytics_write_catalog', 'snowflake_polaris' );
Setting Up AWS S3 Tables Integration
For S3 Tables integration, follow the AWS S3 Tables Getting Started Guide to create your table bucket, then use iceberg-s3tables
as the catalog type when configuring PGAA:
SELECT bdr.replicate_ddl_command($$ SELECT pgaa.add_catalog( 's3_tables_catalog', 'iceberg-s3tables', '{"bucket": "your-table-bucket", "region": "us-east-1", "access_key": "YOUR_ACCESS_KEY", "secret_key": "YOUR_SECRET_KEY"}' ); $$);
Verification and Testing
After configuration, verify your integration by testing these operations:
Reading from Catalog
-- List available tables SELECT * FROM pgaa.catalog_tables('your_catalog_name'); -- Query an Iceberg table SELECT * FROM your_catalog_name.schema_name.table_name LIMIT 10;
Writing to Catalog
-- Create a new Iceberg table CREATE TABLE your_catalog_name.schema_name.new_table AS SELECT * FROM your_postgresql_table;
PGD Offloads
Verify that PGD can successfully offload data to the configured catalog according to your retention policies.
External Tool Compatibility
Test that third-party analytical tools can read both tables and views created through PGAA.
Known Limitations
- Token Management: Snowflake Polaris tokens expire after 1 hour and must be manually refreshed
- Naming Restrictions: Catalogs with spaces in their names may cause query errors
- OAuth Flow: The OAuth client credentials flow must be performed manually outside of PGAA
Troubleshooting
Common Issues
Token Expiration Error
- Symptom: Queries fail with authentication errors after approximately 1 hour
- Solution: Regenerate the OAuth token and update the catalog configuration
Catalog Import Failures
- Symptom:
pgaa.import_catalog()
returns errors - Solution: Verify network connectivity and ensure the catalog URL is accessible from all PostgreSQL nodes
Query Performance Issues
- Symptom: Slow query performance when accessing Iceberg tables
- Solution: Ensure proper partition pruning and consider creating materialized views for frequently accessed data
Next Steps
After successfully configuring catalog integration:
- Establish token refresh procedures for production environments
- Configure monitoring for catalog connectivity
- Implement data lifecycle policies for automated offloading
- Optimize query patterns for mixed PostgreSQL-Iceberg workloads