How-To Lakehouse Read With/Without A Catalog v1.3
Overview
This guide demonstrates how to configure PGAA Lakehouse clusters for accessing data lake formats, both with and without catalog integration. PGAA supports direct access to Delta Lake and Apache Iceberg tables, enabling analytical workloads on object storage without data movement.
Prerequisites
System Requirements
- PGAA Lakehouse cluster provisioned and accessible
- PostgreSQL File System (PGFS) version 2.0.1 or later recommended
- Administrative privileges for catalog and storage configuration
Access Requirements
- For catalog integration: Valid credentials for target catalog service
- For direct access: Read permissions on target S3 buckets or storage locations
- Network connectivity to storage endpoints
Configuration Approaches
PGAA supports three primary approaches for lakehouse data access:
- Catalog-managed access: Full integration with external catalog services
- Direct Delta Lake access: Query Delta tables without catalog dependency
- Direct Iceberg access: Query Iceberg tables using metadata files
AWS S3 Tables Integration
Python Client Configuration
AWS S3 Tables requires REST API compatibility for Python clients. Configure PyIceberg as follows:
from pyiceberg.catalog import load_catalog # Define S3 Tables parameters REGION = "eu-north-1" # Your S3 Tables bucket region ARN = "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket" s3tables_catalog = load_catalog( "s3tables_catalog", **{ "type": "rest", "warehouse": ARN, "uri": f"https://s3tables.{REGION}.amazonaws.com/iceberg", "rest.sigv4-enabled": "true", "rest.signing-name": "s3tables", "rest.signing-region": REGION, } )
Note: PyIceberg versions ≤0.9.1 perform credential resolution at each API call and cannot pass S3 options directly during catalog definition.
PGAA Native Configuration
PGAA implements a native Rust S3 Tables client, eliminating REST proxy overhead:
SELECT bdr.replicate_ddl_command($$ SELECT pgaa.delete_catalog('s3tables_catalog'); SELECT pgaa.add_catalog( 's3tables_catalog', 'iceberg-s3tables', '{"arn": "arn:aws:s3tables:eu-north-1:0123456789:bucket/your-bucket", "region": "eu-north-1"}' ); SELECT pgaa.import_catalog('s3tables_catalog'); $$);
Key Differences:
- Catalog type:
iceberg-s3tables
(noticeberg-rest
) - Required parameters: ARN and region only
- Authentication: AWS credential chain (profiles, environment variables, IMDS)
- Explicit credentials not supported; relies on SDK credential resolution
Direct Delta Lake Access
Storage Location Setup
Configure storage locations for Delta Lake tables:
-- PGFS 2.0.1 and later SELECT pgfs.create_storage_location( 'biganimal-sample-data', 's3://beacon-analytics-demo-data-us-east-1-prod', '{"aws_skip_signature": "true"}' ); -- Legacy syntax (before PGFS 2.0.1) SELECT pgfs.create_storage_location( 'biganimal-sample-data', 's3://beacon-analytics-demo-data-eu-west-1-prod', NULL, -- managed storage location ID (unused) '{"aws_skip_signature": "true"}', -- options '{}' -- credentials );
Creating Delta Tables
Map Delta Lake tables using PGAA table access method:
-- Create customer table CREATE TABLE public.customer () USING PGAA WITH ( pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/customer' ); -- Verify table access SELECT count(*) FROM public.customer; -- Create additional TPC-H tables CREATE TABLE public.lineitem () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/lineitem'); CREATE TABLE public.orders () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/orders'); CREATE TABLE public.nation () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1/nation');
Working with Large Datasets
Configure separate schema for scale testing:
CREATE SCHEMA tpch; -- Create 1TB scale factor tables CREATE TABLE tpch.lineitem () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/lineitem'); CREATE TABLE tpch.customer () USING PGAA WITH (pgaa.storage_location = 'biganimal-sample-data', pgaa.path = 'tpch_sf_1000/customer'); -- Additional tables follow same pattern...
Direct Iceberg Access
Configuring Iceberg Storage
-- Create storage location for Iceberg data SELECT pgfs.create_storage_location( 'biganimal-sample-data-dev', 's3://beacon-analytics-demo-data-us-east-1-dev', '{"aws_skip_signature": "true"}' ); -- Create Iceberg table reference CREATE TABLE iceberg_table () USING PGAA WITH ( pgaa.storage_location = 'biganimal-sample-data-dev', pgaa.path = 'iceberg-example/default.db/iceberg_table', pgaa.format = 'iceberg' -- Explicitly specify format ); -- Query Iceberg data SELECT * FROM iceberg_table ORDER BY key ASC;
Catalog Attachment
Connecting to Existing Catalogs
Attach and query data from configured catalogs:
-- Add REST catalog SELECT pgaa.add_catalog( 'lakekeeper-test', 'iceberg-rest', '{ "url": "https://catalog-endpoint.example.com", "token": "your-api-token", "warehouse": "warehouse-id", "danger_accept_invalid_certs": "true" }' ); -- Attach catalog to current session SELECT pgaa.attach_catalog('lakekeeper-test'); -- Query catalog tables SELECT COUNT(*) FROM tpch_sf_1.lineitem;
Verifying Catalog Access
-- Check available schemas SELECT * FROM information_schema.schemata WHERE schema_name LIKE 'tpch%'; -- Verify offloaded data views SELECT * FROM partitioned_table_offloaded LIMIT 10;
Query Validation
Sample Analytical Query
Validate configuration with complex joins:
SELECT c_custkey, c_name, sum(l_extendedprice * (1 - l_discount)) AS revenue, c_acctbal, n_name, c_address, c_phone, c_comment FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate >= CAST('1993-10-01' AS date) AND o_orderdate < CAST('1994-01-01' AS date) AND l_returnflag = 'R' AND c_nationkey = n_nationkey GROUP BY c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment ORDER BY revenue DESC LIMIT 20;
Expected Results:
- Customer#000057040: revenue 734235.2455
- Customer#000143347: revenue 721002.6948
- Customer#000060838: revenue 679127.3077
Best Practices
Storage Configuration
- Use PGFS 2.0.1+ simplified syntax when available
- Configure region-appropriate endpoints for optimal performance
- Implement proper credential management through AWS credential chain
Catalog Management
- Regularly refresh OAuth tokens for REST catalogs
- Use native clients when available (S3 Tables)
- Import catalog metadata after configuration changes
Performance Optimization
- Partition large datasets appropriately
- Create materialized views for frequently accessed data
- Monitor query execution plans for optimization opportunities
Troubleshooting
Common Issues
Storage Access Errors
- Verify S3 bucket permissions and network connectivity
- Check credential chain configuration for AWS authentication
Catalog Connection Failures
- Validate catalog endpoint URLs and credentials
- Ensure proper SSL certificate validation settings
Query Performance Degradation
- Review table statistics and partition pruning
- Consider data locality and caching strategies
Next Steps
- Configure monitoring for lakehouse queries
- Establish data refresh schedules for materialized views
- Implement cost optimization through intelligent caching
- Develop governance policies for catalog access control