CDC Failover support v5.7.0

Availability

This is a PGD 5.7 and later feature. It is not supported on earlier versions of PGD.

Background

Earlier versions of PGD have allowed the creation of logical replication slots on nodes that can provide a feed of the logical changes happening to the data in the database. These logical replication slots have been local to the node and not replicated. Apart from only replicating changes on the particular node, this behavior has presented challenges when faced with node failover in the cluster. In that scenario, a consumer of the logical replication off a node that fails has no replica of the slot on another node to continue consuming from.

While solutions to this can be engineered using a subscriber-only node as an intermediary, it significantly raises the cost of logical replication.

CDC Failover support

To address this need, PGD 5.7 introduces CDC Failover support. This is an optionally enabled feature that activates automatic logical slot replication across the cluster. This, in turn, allows a consumer of a logical slot’s replication to receive change data from any node when a failure occurs.

How CDC Failover works

When a logical slot is created on a node with CDC Failover support enabled, the slot is replicated across the cluster. This means that the slot is available for consumption on any node in the cluster. When a node fails, the slot can be consumed from another node in the cluster. This allows for continuing the logical replication stream without interruption.

If, though, the consumer of the slot connects to a different node in the cluster, the previous connection the consumer had will be closed by PGD. This behavior is to ensure that the slot isn't being consumed from multiple nodes at the same time. In the background, PGD is using its Raft consensus protocol to ensure that the slot is being consumed from only one node at a time. This means that the guarantee of only one slot being consumed at a time doesn't hold in split-brain scenarios.

Currently CDC Failover support is a global option that's controlled by a top-group option. The failover_slot_scope top-group option can currently be set to (and defaults to) local, which disables replication of logical slots, or global. The global setting enables the replication of all non-temporary logical slots created in the PGD database.

Temporary logical slots aren't replicated, as they have a lifetime scoped to the session that created them and will go away when that session ends.

At-least-once delivery guarantees

CDC Failover support takes steps to ensure that the consumer receives all changes at least once. This is done by holding back slots until delivery has been confirmed, at which point the slot is then advanced on all nodes in an asynchronous manner. In the case of a failure on the node where the slot was being consumed, the slot is held until the consumer connects to a node in the cluster. This then allows the slot to progress.

Important

If a consuming application disconnects and doesn’t reconnect, the slot will remain held back on every node in the cluster. As this consumes disk and memory, it's essential to avoid this situation. Applications that consume slots must return to consuming as soon as possible.

Exactly-once delivery

Currently, there's no way to ensure exactly-once delivery, and we expect consuming applications to manage the discarding of previously completed transactions.

Enabling CDC Failover support

To enable CDC Failover support run the SQL command and call the bdr.alter_node_group_option function with the following parameters:

select bdr.alter_node_group_option(<top-level group name>,
     'failover_slot_scope',
     'global');

Replace <top-level group name> with the name of your cluster’s top-level group. If you don't know the name, it's the group with a node_group_parent_id equal to 0 in [bdr.node_group](/pgd/5/reference/catalogs-visible#bdrnode_group).

If you do not know the name, it is the group with a node_group_parent_id equal to 0 in [bdr.node_group](/pgd/latest/reference/catalogs-visible#bdrnode_group). You can also use:

SELECT bdr.alter_node_group_option(
     node_group_name,
     'failover_slot_scope',
     'global')
    from bdr.node_group	
    where node_group_parent_id=0;

This command ensures you're setting the correct top-level group’s option.

Once CDC Failover is enabled, to create a new globally replicated slot, you can use:

SELECT pg_create_logical_replication_slot('myslot',
                                          'test_decoding');

Logical replication slots created before the option was set to global aren't replicated. Only new slots are replicated.

Failover slots can also be created with the CREATE_REPLICATION_SLOT command on a replication connection.

The status of failover slots is tracked in the bdr.failover_replication_slots table.

CDC Failover support with Postgres 17+

For Postgres 17 and later, support for failover was added to allow standbys to be resumed. Use an option in pg_create_logical_replication_slot named failover for this purpose. This new setting requires that, no matter what the setting of failover_slot_scope, you must also set failover to true.

SELECT pg_create_logical_replication_slot('myslot',
                                          'test_decoding',
                                          failover=>true);

Limitations

The CDC Failover Slot support comes with certain limitations:

  • CDC Failover slot support requires the latest versions of EDB Postgres Distributed (PGD) 5.7+ and the latest minor releases of Postgres Extended or EDB Postgres Advanced Server (available Feb 2025).
  • CDC Failover support is a global option and can't be set on a per-slot basis. Because changing the enabled status of CDC Failover doesn't affect previously provisioned slots, it's possible to enable it (set to global), create a replicated slot, then disable it (set to local) to create a singular replicated slot.
  • CDC Failover support isn't supported on temporary slots.
  • CDC Failover support isn't supported on slots created with the failover option set to false.
  • CDC Failover support works with EDB Postgres Advanced Server and EDB Postgres Extended Server only. It isn't supported on community Postgres installations.
  • Existing slots aren't converted into failover slots when the option is enabled.
  • While Postgres’s built-in functions such as pg_logical_slot_get_changes() can be used, they won’t ensure that the slot isn't being decoded anywhere else and can’t update replication progress accurately across the cluster. Therefore, we recommend that you don't rely on the function to receive decoded changes.