How-To Read/Write Without Catalog v1.3

Purpose: Configure EDB PGAA with PGD High Availability to implement automated data tiering, analytics offload, and lakehouse architectures for cost optimization and performance scaling.

Scope: Complete workflow from initial setup through advanced partition management, including Delta Lake and Iceberg integration, automated tiering strategies, and storage optimization analysis.

Prerequisites

Technical Requirements

PostgreSQL with PGD High Availability must be operational with administrative privileges for catalog management, partition operations, and storage configuration. PGAA extension requires installation with network connectivity to external storage systems including S3-compatible endpoints. Sufficient bandwidth supports data transfer operations between storage tiers during offload processes.

Knowledge Requirements

Understanding of PostgreSQL partitioning concepts, system catalog structures, and distributed database management enables effective implementation. Familiarity with object storage systems, lakehouse architectures, and analytics workload patterns helps optimize configuration decisions. Basic knowledge of SQL performance analysis supports troubleshooting and optimization activities.

Environment Setup

Active PGD cluster with multiple nodes provides high availability foundation while PGAA extension enables analytics storage integration. Network security configurations must permit access to external storage endpoints with appropriate firewall and authentication settings. Monitoring infrastructure should track background task processing and storage utilization patterns.

Setup: Configure Analytics Infrastructure

Step 1: Initialize PGAA Extension and Performance Settings

Configure the analytics infrastructure with optimized settings for testing and production deployment scenarios.

-- Setup PGD configuration for efficient testing cycles
SET bdr.taskmgr_nap_time = 10000;  -- AutoPartition worker runs every 10s vs default 290s
SET pgaa.max_replication_lag_s = 5;  -- Analytics replication lag threshold
SET pgaa.flush_task_interval_s = 5;  -- Frequent storage flush for testing

-- Initialize PGAA extension
CREATE EXTENSION IF NOT EXISTS pgaa CASCADE;

The performance settings accelerate background task processing during development and testing phases. Production environments should evaluate these settings against system resources and operational requirements. The PGAA extension provides the foundation for analytics storage integration and lakehouse table management.

Step 2: Configure Storage Locations

Establish storage locations for both read-only analytics access and write operations with appropriate security configurations. Please see

-- Create read-only storage location for existing Delta Lake data
SELECT bdr.replicate_ddl_command($$
  SELECT pgfs.create_storage_location(
    'biganimal-sample-data',
    's3://beacon-analytics-demo-data-us-east-1-prod',
    '{"aws_skip_signature": "true"}'
  );
$$);

-- Create development storage location for Iceberg tables
SELECT bdr.replicate_ddl_command($$
  SELECT pgfs.create_storage_location(
    'biganimal-sample-data-dev',
    's3://beacon-analytics-demo-data-us-east-1-dev',
    '{"aws_skip_signature": "true"}'
  );
$$);

PGFS. PGFS Storage location configuration establishes the foundation for both read-only analytics access and write operations. The aws_skip_signature setting applies to public buckets but production environments require proper AWS credential configuration through IAM roles or credential files.

Step 3: Configure Analytics Write Location

Set up the primary analytics storage location for automated data offload operations. See Getting Setup.

-- Create appliance storage location with dynamic path construction
SELECT bdr.replicate_ddl_command($$
  SELECT pgfs.create_storage_location(
    'appliance-bucket',
    '$BUCKET_URL' || '/' ||
    (SELECT node_group_name FROM bdr.node_group WHERE node_group_name != 'world' LIMIT 1) ||
    '/analytics-offload'
  )
$$);

-- Validate storage location accessibility
CREATE TABLE sentinel_sl_read_test () USING PGAA
WITH (pgaa.format = 'iceberg', pgaa.storage_location = 'appliance-bucket', pgaa.path = 'nonexistent');
-- Should return ERROR: No valid Iceberg metadata file found (confirms bucket access)

-- Set as primary analytics write location
SELECT bdr.alter_node_group_option(
  (SELECT node_group_name FROM bdr.node_group WHERE node_group_name != 'world' LIMIT 1),
  'analytics_storage_location',
  'appliance-bucket'
);

The dynamic path construction includes the PGD node group name to prevent conflicts in multi-tenant environments. Storage location validation through deliberate error testing confirms proper authentication and network connectivity. Setting the analytics storage location activates the Seafowl query engine for lakehouse operations.

Step 4: Verify Seafowl Query Engine Activation

Confirm that the analytics query engine has started and is ready for lakehouse operations.

-- Verify Seafowl query engine is running
SELECT application_name FROM pg_stat_activity WHERE application_name LIKE '%seafowl%';
-- Must show 'pgd seafowl_manager' for proper operation

The Seafowl query engine provides optimized execution for analytics workloads across columnar storage formats.

Read-Only Analytics: Access Existing Data

Iceberg format specification enables access to Apache Iceberg tables with full metadata support including schema evolution, partition management, and time travel capabilities.

Step 5: Analyze Storage Characteristics

Use analytics functions to understand storage utilization and performance characteristics of lakehouse tables.

-- Analyze storage statistics for Delta Lake tables
SELECT * FROM pgaa.lakehouse_table_stats('lineitem'::regclass);

-- Analyze storage statistics for Iceberg tables
SELECT * FROM pgaa.lakehouse_table_stats('iceberg_table'::regclass);

Storage statistics provide insights into compression ratios, file organization, and query performance characteristics that inform optimization decisions. These metrics support capacity planning and performance tuning for analytics workloads.

Transactional Table Analytics Offload

Step 9: Create and Populate Transactional Table

Establish a transactional table with representative data for offload testing and validation.

-- Create transactional table for offload testing
CREATE TABLE transactional_table (
  key BIGINT,
  inserted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  revenue DOUBLE PRECISION,
  comment TEXT,
  PRIMARY KEY (key, inserted_at)
);

-- Insert test data with temporal distribution
INSERT INTO transactional_table (
  SELECT
    g,
    '2010-01-01 00:00:00'::timestamp + INTERVAL '1 second' *
      (EXTRACT (EPOCH FROM (CURRENT_TIMESTAMP - '2010-01-01 00:00:00'::timestamp)) / 1000000 * g),
    g * 0.001,
    encode(sha256(g::text::bytea), 'hex')
  FROM generate_series(1, 1000000) g
);

The data generation creates realistic temporal distribution patterns while providing sufficient volume for performance testing. SHA256 hash generation simulates variable-length text content found in production systems.

Step 10: Configure Analytics Offload

Enable automatic replication of transactional data to analytics storage for cost optimization and performance scaling.

-- Enable analytics replication for transactional table
ALTER TABLE transactional_table SET (pgd.replicate_to_analytics = TRUE);

-- Monitor replication progress
SELECT * FROM bdr.analytics_table;
-- Wait until table appears in results (~120s for initial sync)

Analytics replication creates an Iceberg representation of transactional data in the configured storage location. The initial synchronization process handles format conversion and establishes ongoing replication streams for data consistency.

Step 11: Create Analytics Reader Interface

Establish direct access to the analytics-tier data through dedicated reader tables.

-- Create reader interface for analytics data
CREATE TABLE transactional_table_analytics_reader ()
USING PGAA
WITH (
  pgaa.format = 'iceberg',
  pgaa.storage_location = 'appliance-bucket',
  pgaa.path = 'public.transactional_table'
);

-- Verify data integrity across storage tiers
SELECT COUNT(*) AS count, ROUND(SUM(revenue)) AS total_revenue
FROM transactional_table_analytics_reader;
-- Should return: count=1000000, total_revenue=500000500

The analytics reader provides direct access to cost-effective storage while maintaining query transparency. Data integrity validation confirms successful format conversion and replication processes.

Step 12: Convert Transactional Table to Analytics-Only

Transform the transactional table to use analytics storage directly while discarding local data.

-- Convert to analytics-only access
ALTER TABLE transactional_table
SET ACCESS METHOD PGAA,
SET (
  pgaa.auto_truncate = 'true',
  pgaa.format = 'iceberg',
  pgaa.storage_location = 'appliance-bucket',
  pgaa.path = 'public.transactional_table'
);

-- Verify conversion maintained data access
SELECT COUNT(*) AS count, ROUND(SUM(revenue)) AS total_revenue FROM transactional_table;

The conversion process eliminates local storage overhead while maintaining transparent SQL access through the analytics tier. Auto-truncate removes local data after confirming analytics tier availability.

Advanced Table Management Operations

Step 13: Restore Table from Analytics to Transactional Storage

Demonstrate the reversibility of analytics offload for scenarios requiring local high-performance access.

-- Restore table to transactional storage
ALTER TABLE transactional_table SET ACCESS METHOD heap, SET (pgd.replicate_to_analytics = FALSE);

-- Verify restoration with query plan analysis
EXPLAIN SELECT COUNT(*) AS count, ROUND(SUM(revenue)) AS total_revenue FROM transactional_table;
-- Should show Parallel Seq Scan indicating local heap access

Restoration capabilities provide flexibility for changing access patterns or performance requirements. Query plan analysis confirms the transition back to local PostgreSQL heap storage with parallel processing capabilities.

Step 14: Handle Data Updates with Analytics Synchronization

Demonstrate data modification workflows that maintain consistency across storage tiers.

-- Update data in transactional table
UPDATE transactional_table SET revenue = revenue * 2;

-- Re-enable analytics replication to propagate changes
ALTER TABLE transactional_table SET (pgd.replicate_to_analytics = TRUE);

-- Wait for synchronization (~120s) then verify changes propagated
SELECT COUNT(*) AS count, ROUND(SUM(revenue)) AS total_revenue
FROM transactional_table_analytics_reader;
-- Should show doubled revenue: total_revenue=1000001000

Update propagation maintains consistency between transactional and analytics tiers while handling data modifications efficiently. The synchronization process recreates the analytics representation with current data.

Step 15: Implement Partial Table Offload

Execute selective data archival where only historical portions of tables move to analytics storage.

-- Perform partial offload with consistent transaction boundary
BEGIN;
CREATE TABLE transactional_table_before_2020 AS (
  SELECT * FROM transactional_table
  WHERE DATE_PART('year', inserted_at) < 2020
);
DELETE FROM transactional_table WHERE DATE_PART('year', inserted_at) < 2020;
COMMIT;

-- Configure partial table for analytics offload
ALTER TABLE transactional_table_before_2020 ADD PRIMARY KEY (key, inserted_at);
ALTER TABLE transactional_table_before_2020 SET (pgd.replicate_to_analytics = TRUE);

-- Convert to analytics-only access
ALTER TABLE transactional_table_before_2020
SET ACCESS METHOD PGAA,
SET (
  pgaa.auto_truncate = 'true',
  pgaa.format = 'iceberg',
  pgaa.storage_location = 'appliance-bucket',
  pgaa.path = 'public.transactional_table_before_2020'
);

-- Verify temporal data distribution
SELECT
  COUNT(*) FILTER (WHERE inserted_at < '2020-01-01') AS before_2020,
  COUNT(*) FILTER (WHERE inserted_at >= '2020-01-01') AS after_2020
FROM transactional_table_before_2020;

Partial offload enables fine-grained data lifecycle management where different time periods have different storage tier requirements. Transaction boundaries ensure data consistency during the archival process.

AutoPartition with Analytics Offload

Step 16: Configure Automated Partitioning

Establish time-based partitioning with automated lifecycle management for scalable data organization.

-- Create partitioned table structure
CREATE TABLE partitioned_table (
  key BIGINT,
  inserted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  revenue DOUBLE PRECISION,
  comment TEXT,
  PRIMARY KEY (key, inserted_at)
) PARTITION BY RANGE(inserted_at);

-- Configure AutoPartition without immediate offload
SELECT bdr.autopartition (
  relation := 'partitioned_table',
  partition_increment := '1 year',
  partition_initial_lowerbound := '2010-01-01 00:00:00',
  managed_locally := TRUE
);

-- Wait for partition creation across cluster
SELECT bdr.autopartition_wait_for_partitions_on_all_nodes('partitioned_table'::regclass, CURRENT_TIMESTAMP::text);

AutoPartition configuration creates annual partitions with boundaries that align with typical data access patterns. The initial configuration excludes offload settings to prevent premature data migration during setup and data loading phases.

Step 17: Load Test Data and Analyze Baseline Storage

Populate partitions with representative data and establish baseline storage metrics for optimization analysis.

-- Load test data across temporal partitions
INSERT INTO partitioned_table (
  SELECT
    g,
    '2010-01-01 00:00:00'::timestamp + INTERVAL '1 second' *
      (EXTRACT (EPOCH FROM (CURRENT_TIMESTAMP - '2010-01-01 00:00:00'::timestamp)) / 1000000 * g),
    g * 0.001,
    encode(sha256(g::text::bytea), 'hex')
  FROM generate_series(1, 1000000) g
);

Data loading creates approximately 65,000 records per annual partition with consistent storage characteristics that enable predictable cost modeling and performance analysis.

Step 18: Baseline Storage Analysis

Create comprehensive storage analysis capabilities and capture pre-offload metrics for comparison purposes.

-- Create optimized row counting function
CREATE OR REPLACE FUNCTION temp_get_table_rowcount(schema_name text, table_name text)
RETURNS bigint AS $$
DECLARE count bigint;
BEGIN
    -- Uses DirectScan for PGAA tables (1.8s vs 20s without optimization)
    EXECUTE FORMAT('SELECT COUNT(*) FROM %I.%I', schema_name, table_name) INTO count;
    RETURN count;
END $$ LANGUAGE plpgsql;

-- Comprehensive partition storage analysis
WITH partitioned_table_info AS (
  SELECT
    nspc.nspname AS schema_name,
    child.relname AS table_name,
    pg_catalog.pg_get_expr(child.relpartbound, child.oid) AS partition_constraint,
    pg_get_partition_constraintdef(child.oid) AS partition_constraint_full
  FROM pg_inherits inh
    JOIN pg_class parent ON inh.inhparent = parent.oid
    JOIN pg_class child ON inh.inhrelid = child.oid
    JOIN pg_namespace nspp ON parent.relnamespace = nspp.oid
    JOIN pg_namespace nspc ON child.relnamespace = nspc.oid
  WHERE parent.relname = 'partitioned_table' AND nspp.nspname = 'public'
),

pgaa_tables AS (
  SELECT
    m.schema_name, m.table_name, m.format,
    CASE
      WHEN m.storage_location_name IS NOT NULL THEN FORMAT('pgfs://%s/%s', m.storage_location_name, m.path)
      WHEN (m.managed_by IS NOT NULL AND m.catalog_namespace IS NOT NULL AND m.catalog_table IS NOT NULL)
        THEN FORMAT('%s.%s.%s', m.managed_by, m.catalog_namespace, m.catalog_table)
    END AS path,
    temp_get_table_rowcount(m.schema_name, m.table_name) AS num_rows,
    s.total_size AS total_size,
    pti.partition_constraint
  FROM pgaa.get_all_analytics_table_settings() m
  LEFT JOIN LATERAL pgaa.lakehouse_table_stats(FORMAT('%I.%I', m.schema_name, m.table_name)::regclass) s ON TRUE
  JOIN partitioned_table_info pti ON m.schema_name = pti.schema_name AND m.table_name = pti.table_name
  WHERE m.storage_location_name IS NOT NULL OR (m.managed_by IS NOT NULL AND m.catalog_namespace IS NOT NULL AND m.catalog_table IS NOT NULL)
),

local_tables AS (
  SELECT
    pn.nspname AS schema_name, pc.relname AS table_name,
    temp_get_table_rowcount(pn.nspname, pc.relname) AS num_rows,
    pg_total_relation_size(FORMAT('%I.%I', pn.nspname, pc.relname)::regclass) AS total_size,
    pti.partition_constraint
  FROM pg_class pc
    JOIN pg_namespace pn ON pc.relnamespace = pn.oid
    JOIN pg_am am ON am.oid = pc.relam
    JOIN partitioned_table_info pti ON pn.nspname = pti.schema_name AND pc.relname = pti.table_name
  WHERE pn.nspname NOT IN ('pg_catalog', 'information_schema', 'pg_toast', 'pg_internal', 'pgaa', 'pgfs', 'bdr')
    AND pc.relkind = 'r' AND am.amname != 'pgaa'
),

all_leaf_tables AS (
  SELECT schema_name, table_name, format AS pgaa_format, path AS pgaa_path, num_rows, total_size, partition_constraint FROM pgaa_tables
  UNION
  SELECT schema_name, table_name, NULL AS pgaa_format, NULL AS pgaa_path, num_rows, total_size, partition_constraint FROM local_tables
),

final_table_info AS (
  SELECT
    schema_name, table_name,
    CASE
      WHEN pgaa_format = 'iceberg' THEN 'iceberg'
      WHEN pgaa_format = 'delta' THEN 'delta'
      ELSE 'heap'
    END AS type,
    COALESCE(pgaa_path, 'local') AS location,
    num_rows, total_size AS total_size_bytes,
    SUBSTRING(partition_constraint, 'FROM \(''([^'']+)') AS start_bound,
    SUBSTRING(partition_constraint, 'TO \(''([^'']+)') AS end_bound
  FROM all_leaf_tables
  ORDER BY partition_constraint ASC
)

SELECT * FROM final_table_info;

Baseline Results: All partitions show type = 'heap' and location = 'local' with approximately 10MB per historical partition, totaling ~154MB for complete dataset in expensive transactional storage.

Step 19: Configure Automated Analytics Offload

Enable automated partition tiering based on data age thresholds while maintaining query transparency.

-- Configure analytics offload for partitions older than one year
SELECT bdr.autopartition (
  relation := 'partitioned_table',
  partition_increment := '1 year',
  partition_initial_lowerbound := '2010-01-01 00:00:00',
  managed_locally := TRUE,
  analytics_offload_period := '1 year'
);

The one-year threshold balances operational access requirements with storage cost optimization by keeping recent data in high-performance storage while migrating historical partitions to cost-effective analytics tiers.

Step 20: Monitor Automated Offload Operations

Track background migration progress and identify completion status for storage tier transitions.

-- Monitor offload work queue until completion
SELECT
  wqs.workid,
  wis.ap_wi_started_at,
  wis.ap_wi_finished_at,
  wis.ap_wi_status,
  "partition",
  sql
FROM bdr.taskmgr_local_workitem_status wis
  RIGHT JOIN bdr.taskmgr_local_work_queue_status wqs ON wis.ap_wi_workid = wqs.workid
WHERE relname = 'public.partitioned_table'
AND sql NOT LIKE '%bdr\.autopartition_create_partition%'
ORDER BY ap_wi_finished_at DESC;

Monitor until all ap_wi_finished_at values are NOT NULL, indicating successful completion. Processing typically requires 10+ minutes for 14 historical partitions (2010-2023 when running in 2025).

Step 21: Validate Storage Tier Distribution

Confirm successful migration through storage tier assignment analysis and local storage reduction verification.

-- Examine storage tier assignments
\dt+
-- Historical partitions should show "pgaa" with "0 bytes"
-- Recent partitions should show "heap" with actual storage allocation

Expected results show historical partitions with "pgaa" access method and zero local storage, confirming successful migration to analytics tier. Recent partitions maintain "heap" access with local storage for optimal operational performance.

Step 22: Post-Migration Storage Analysis and ROI Calculation

Execute comprehensive storage analysis to quantify optimization results and calculate return on investment.

-- Run same storage analysis query as Step 18 to capture post-migration metrics

-- Verify data integrity across storage tiers
SELECT COUNT(*) AS count, ROUND(SUM(revenue)) AS total_revenue FROM partitioned_table;
-- Should maintain: count=1000000, total_revenue=500000500

Storage Optimization Results:

MetricBefore MigrationAfter MigrationSavings
Local Storage (Historical)144MB16MB128MB (89% reduction)
Storage Cost Profile100% premium storage11% premium, 89% standard60-80% cost reduction
Query PerformanceUniform fast accessRecent fast, historical acceptableOptimized by access pattern
Administrative OverheadManual managementAutomated lifecycleEliminated manual intervention

The automated tiering delivers substantial infrastructure cost reductions while maintaining transparent query access across storage tiers. Recent data remains in high-performance storage for operational requirements while historical data optimizes for cost-effective analytics storage.

Troubleshooting and Validation

Common Issues and Resolution

Storage Location Access: Verify credential configuration and network connectivity to external storage endpoints. Use deliberate error testing to confirm bucket accessibility and authentication.

Offload Processing Delays: Monitor background task queues and adjust bdr.taskmgr_nap_time for development environments. Production systems should maintain default settings for resource optimization.

Data Integrity Concerns: Use row counts and aggregate calculations to verify data consistency across storage tier transitions. Analytics storage maintains complete data fidelity through format conversion processes.

Query Performance Variations: Expect different performance characteristics between storage tiers, with analytics storage optimized for throughput over latency. Recent data in local storage maintains optimal response times.

Performance Monitoring

Track storage utilization trends, query response times across tiers, and background task completion rates. Monitor network connectivity to external storage systems and resource utilization during offload operations.

Regular analysis of access patterns may reveal opportunities to optimize offload thresholds or identify partitions requiring different lifecycle policies based on business requirements.

The comprehensive configuration provides automated data tiering capabilities that balance operational performance requirements with storage cost optimization while maintaining transparent SQL access across hybrid storage architectures.