Hybrid Manager Multi-DC deployment guide v1.3
Overview
Why run Hybrid Manager across multiple data centers?
Multi-DC gives you high availability and disaster recovery for Postgres workloads and the Hybrid Manager (HM) control plane (CP):
Survive a site loss (DR): Keep a warm Secondary site ready. If the Primary DC is unavailable, promote replicas in the Secondary and restore service.
Minimize downtime (HA): Perform maintenance or migrations on one site while workloads continue on the other.
Protect data (RPO): Continuous replication to a second DC reduces potential data loss compared to single-site backups only.
Reduce blast radius: Faults, misconfigurations, or noisy neighbors in one DC don’t take down the other.
Meet compliance/sovereignty: Keep copies in a specific region or facility while still centralizing control.
Operate at scale: Split read traffic, stage upgrades, or run blue/green cutovers across DCs.
RTO/RPO at a glance
- RTO (time to restore service): Typically minutes, driven by your promotion/cutover runbook and DNS/LB changes.
RPO (data loss window):
Async replication (common across DCs): very low, but not zero (best-effort seconds).
Sync replication (latency-sensitive): can approach zero data loss, but adds cross-DC latency and requires robust low-latency links.
What this guide helps you do
Connect two HM clusters (Primary ↔ Secondary) on the same provider/on-prem family.
Align object storage (identical edb-object-storage secret) so backups/artifacts are usable in both DCs.
Enable SPIRE federation (trust domains via 8444/TCP) so platform identities work cross-site.
Wire the Agent (Beacon) so the Primary can register the Secondary as a managed location and provision there (9445/TCP).
(Optional) Federate telemetry (Thanos/Loki) for cross-site metrics and logs.
Prepare a Postgres topology with a primary in one DC and replicas in the other; perform manual failover by promoting replicas.
Current limitations
Two sites (Primary and Secondary).
Manual failover: Promote replicas in the Secondary if the Primary is down.
Same cloud/on-prem family: Cross-CSP multi-DC is not supported.
Architecture at a glance
Control plane: Two HM clusters, federated via SPIRE; Primary “manages” the Secondary as a Location through Beacon.
Data nodes: Postgres primary in DC-A, replicas in DC-B (async by default).
Storage: Shared/consistent object store config across sites for backups/artifacts.
Telemetry (optional): Thanos/Loki federation to view metrics/logs across sites.
Who is this for?
This is for teams that need higher resilience than a single DC can provide, and are comfortable running a manual, well-rehearsed failover playbook with clearly defined RTO/RPO targets.
Prerequisites
Architecture prereqs
Two HM clusters available: Primary and Secondary (Kubernetes contexts configured).
Each cluster has a unique SPIRE trust domain.
Network connectivity:
8444/TCP open between clusters (SPIRE bundle endpoint).
9445/TCP from Secondary → Primary (Beacon gRPC).
Tooling: kubectl, jq (and yq if editing YAML locally).
Same provider/on-prem family (no cross-cloud).
Tools:
jq
,yq
, AWS command line
Scripts:
- There are a number of scripts referred to use throughout this procedure. Please contact EDB Professional Services or EDB Support Center to get access to the scripts.
Discover and set necessary environment variables
Tell
kubectl
which clusters are Primary vs. SecondarySet the kube contexts you’ll use for all subsequent discovery commands:
export KUBE_CONFIG_PRIMARY_CONTEXT="<primary-kube-ctx-or-arn>" export KUBE_CONFIG_SECONDARY_CONTEXT="<secondary-kube-ctx-or-arn>"
Discover the Primary portal FQDN (used by Beacon/telemetry)
Query the Primary cluster for the Beacon gateway host, then export it as
PRIMARY_PORTAL_URL
:export PRIMARY_PORTAL_URL="$( kubectl --context "$KUBE_CONFIG_PRIMARY_CONTEXT" \ -n upm-beacon get gw beacon-server -o json \ | jq -r '.spec.servers[1].hosts[0]' )"
(Optional) Discover the Secondary portal FQDN
If you’ll also federate telemetry, capture the Secondary portal host as
SECONDARY_PORTAL_URL
and set the environment variable accordingly:export SECONDARY_PORTAL_URL="$( kubectl --context "$KUBE_CONFIG_SECONDARY_CONTEXT" \ -n upm-beacon get gw beacon-server -o json \ | jq -r '.spec.servers[1].hosts[0]' )"
Derive the Beacon gRPC endpoint for the Primary
Beacon listens on
:9445
on the Primary portal host.export BEACON_SERVER_ENDPOINT_PRIMARY="${PRIMARY_PORTAL_URL}:9445"
Discover the SPIRE trust domain for the Primary
Read the SPIRE server config from the Primary and export
TRUST_DOMAIN_PRIMARY
.export TRUST_DOMAIN_PRIMARY="$( kubectl --context "$KUBE_CONFIG_PRIMARY_CONTEXT" \ -n spire-system get cm spire-server \ -o jsonpath="{['data']['server\.conf']}" \ | jq -r '.server.trust_domain' )"
Discover the SPIRE trust domain for the Secondary
Do the same for the Secondary and export
TRUST_DOMAIN_SECONDARY
.export TRUST_DOMAIN_SECONDARY="$( kubectl --context "$KUBE_CONFIG_SECONDARY_CONTEXT" \ -n spire-system get cm spire-server \ -o jsonpath="{['data']['server\.conf']}" \ | jq -r '.server.trust_domain' )"
Choose a label for the managed Secondary location
Pick any unique label (it will show up as
managed-<label>
on the Primary).export SECONDARY_LOCATION_NAME="secondary"
(Optional for EKS/IRSA helpers) Set the EKS identifiers
Only do this if you’ll use the helper to update S3 trust policy and copy the object-store secret. This step requires
PRIMARY_EKS
,SECONDARY_EKS
, andAWS_PROFILE
.export PRIMARY_EKS="<region>:<primary-eks-name>" export SECONDARY_EKS="<region>:<secondary-eks-name>" export AWS_PROFILE="<aws-profile>"
Sanity check before proceeding
Verify the key variables are set (will error if any are missing):
: "${KUBE_CONFIG_PRIMARY_CONTEXT:?}"; : "${KUBE_CONFIG_SECONDARY_CONTEXT:?}" : "${PRIMARY_PORTAL_URL:?}"; : "${BEACON_SERVER_ENDPOINT_PRIMARY:?}" : "${TRUST_DOMAIN_PRIMARY:?}"; : "${TRUST_DOMAIN_SECONDARY:?}" : "${SECONDARY_LOCATION_NAME:?}"
Here are some optional checks if you’re doing telemetry or EKS helper
: "${SECONDARY_PORTAL_URL:?Set if doing telemetry federation}" || true : "${PRIMARY_EKS:?Set if using EKS helper}" || true : "${SECONDARY_EKS:?Set if using EKS helper}" || true : "${AWS_PROFILE:?Set if using EKS helper}" || true
Note
If you open a new shell or run a new CI step later, re-run these exports (or save them to a file and source it).
Object storage across locations
HM uses an object store for backups, artifacts, WAL, and internal bundles. In multi-DC, both clusters must use the same object store configuration.
Key requirement
Each cluster must have an identical Kubernetes secret named edb-object-storage
in the default namespace.
Note
Create/sync edb-object-storage
before installing HM at any secondary location.
You create the initial secret when setting up object storage during installation for the Primary. Replicate it now to the Secondary:
# Clean slate on Secondary kubectl delete secret \ --context=$KUBE_CONFIG_SECONDARY_CONTEXT \ -n default edb-object-storage || true # Copy Primary → Secondary kubectl get secret \ --context=$KUBE_CONFIG_PRIMARY_CONTEXT \ -n default edb-object-storage -o yaml | \ kubectl apply \ --context=$KUBE_CONFIG_SECONDARY_CONTEXT \ -n default -f -
EKS (IRSA) trust policy
If you use S3 + IAM Roles for Service Accounts (IRSA), the role must trust both clusters’ OIDC providers.
Either update the trust policy manually to include both OIDC issuers, or
Retrieve the helper
object-storage-on-multi-dc.sh
helper script:Use the helper script to copy the secret and append OIDC providers to the role’s trust policy automatically:
./object-storage-on-multi-dc.sh \ -p <region>:<primary-eks-name> \ -s <region>:<secondary-eks-name>[,<region>:<another-eks>] \ -a <aws-profile>
The
object-storage-on-multi-dc.sh
helper script reads the IAM role ARN from the Primary’s edb-object-storage secret, copies the identical secret to each Secondary, and appends the Secondary cluster’s OIDC provider to the role trust policy if missing.Validation checklist
Secrets identical (compare .data only).
Both clusters can list/write the bucket (quick Pod/Job test).
IRSA role trust includes both OIDC providers (if EKS).
Setup options
Choose Option A (one-shot) or Option B (step-by-step). You’ll reach the same end state.
Option A — Quick start (master script)
The master script runs object storage sync, SPIRE federation, Beacon wiring, and optional telemetry in one pass.
Retrieve the
master-install.sh
script.Run the script:
cd scripts/multi-dc ./master-install.sh
Useful flags:
./master-install.sh --dry-run \ --skip-object-store --skip-federation --skip-beacon --skip-telemetry
Verify:
Verify that SPIRE federation is listed:
kubectl -n spire-system exec svc/spire-server -c spire-server -- \ /opt/spire/bin/spire-server federation list
Verify the secondary location registered in Primary
kubectl get location
Option B — Manual setup (advanced/customizable)
Retrieve the necessary scripts
You need to retrieve these scripts in order to manually setup multi-DC:
update_objectstore_secrets.sh
apply-federated-domain.sh
configure-beacon-primary.sh
configure-beacon-secondary.sh
install.sh
(if setting up telemetry federation)
Run the scripts to setup multi-DC
Run these scripts in order:
update_objectstore_secrets.sh
apply-federated-domain.sh
configure-beacon-primary.sh
configure-beacon-secondary.sh
install.sh
(if setting up telemetry federation)
Object storage sync:
./update_objectstore_secrets.sh # (EKS only, if IRSA) ./eks-object-storage-on-multi-dc.sh -p $PRIMARY_EKS -s $SECONDARY_EKS -a $AWS_PROFILE
SPIRE federation:
./apply-federated-domain.sh $KUBE_CONFIG_PRIMARY_CONTEXT $KUBE_CONFIG_SECONDARY_CONTEXT
Validate:
kubectl -n spire-system exec svc/spire-server -c spire-server -- \ /opt/spire/bin/spire-server federation list
Beacon wiring:
./configure-beacon-primary.sh $TRUST_DOMAIN_SECONDARY ./configure-beacon-secondary.sh $BEACON_SERVER_ENDPOINT_PRIMARY $TRUST_DOMAIN_PRIMARY $SECONDARY_LOCATION_NAME
Validate:
kubectl get location
(Optional) Telemetry federation
cd thanos && ./install.sh -l secondary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL ./install.sh -l primary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL cd ../fluent-bit && ./install.sh -l primary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL ./install.sh -l secondary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL
SPIRE federation details (what/why/how)
SPIRE federation lets each SPIRE server trust the peer trust domain and continuously refresh its bundle (requires 8444/TCP). It can be configured via CRDs or spire-server federation CLI.
Typical flow (CRD-based):
Generate ClusterFederatedTrustDomain manifest from each cluster (helper script).
Cross-apply them (A → B, B → A).
Validate with spire-server federation list.
Post-federation: Any workload identity that needs to cross sites must have a ClusterSPIFFEID
with federatesWith: "<peer-trust-domain>"
.
Beacon configuration (Primary & Secondary)
Beacon enables the Primary HM to register the Secondary as a managed “location” and provision there.
Primary: allow Secondary trust domain
Export current values and edit:
kubectl get configmap -n edbpgai-bootstrap -l app=edbpgai-bootstrap -o yaml \ | yq '.items[0].data["values.yaml"]' > /tmp/primary-boot-values.yaml
Edit /tmp/primary-boot-values.yaml:
beaconServer: additionalTrustDomains: - "<secondary-location-trust-domain>"
Retrieve the Agent (beacon) install script,
install-dev.sh
.Reinstall Beacon:
./install-dev.sh -f <provider> -a install -c upm-beacon -v <beacon-version> -p /tmp/primary-boot-values.yaml
Secondary: point agent to Primary Beacon
From the Secondary:
kubectl get configmap -n edbpgai-bootstrap -l app=edbpgai-bootstrap -o yaml \ | yq '.items[0].data["values.yaml"]' > /tmp/secondary-boot-values.yaml
Edit
/tmp/secondary-boot-values.yaml
:parameters: upm-beacon: beacon_location_id: "secondary" # unique label beaconAgent: beaconServerAddress: "<primary-portal-fqdn>:9445" beaconServerTrustDomain: "<primary-trust-domain>" plaintext: false tlsInsecure: false inCluster: false
Retrieve the Agent (beacon) install script
install-dev.sh
.Reinstall Beacon:
./install-dev.sh -f <provider> -a install -c upm-beacon -v <beacon-version> -p /tmp/secondary-boot-values.yaml
Federate Agent (Beacon) SPIFFE IDs (required)
Retrieve the
federate-beacon-spiffe-ids.sh
script.Run from Secondary (include Primary trust domain):
./federate-beacon-spiffe-ids.sh "$TRUST_DOMAIN_PRIMARY" kubectl rollout restart -n upm-beacon deploy/upm-beacon-server kubectl rollout restart -n upm-beacon deploy/upm-beacon-agent-k8s
Run from Primary (include Secondary trust domain):
./federate-beacon-spiffe-ids.sh "$TRUST_DOMAIN_SECONDARY" kubectl rollout restart -n upm-beacon deploy/upm-beacon-server kubectl rollout restart -n upm-beacon deploy/upm-beacon-agent-k8s
Validate registration (Primary):
kubectl get location # Expect: managed-<SECONDARY_LOCATION_NAME> with recent LASTHEARTBEAT
Telemetry federation (optional)
If you need cross-site metrics/logs:
Thanos (metrics)
Retrieve the Thanos install script,
install.sh
.Secondary:
./install.sh -l secondary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL
Primary:.
./install.sh -l primary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL
Validate: port-forward
thanos-query
and hit/api/v1/stores
for entries containingthanos-query-federated
.
Fluent Bit/Loki (logs)
Retrieve the Loki install script,
install.sh
.Primary:
./install.sh -l primary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL
Secondary:
./install.sh -l secondary -p $PRIMARY_PORTAL_URL -s $SECONDARY_PORTAL_URL
Validate: port-forward
loki-read
and query for{app="fluent-forward"}
.
Note
Use distinct prefixes for metrics and logs in values:
- "
global.metrics_storage_prefix
" - "
upm-loki.logs_storage_prefix
"
Database topology and cross-DC provisioning
Goal: Wire Beacon so the Primary HM can treat the Secondary as a managed location and you can provision PG primary/replicas across DCs.
Prereqs recap
Two HM Kubernetes clusters (same provider/on-prem family), with SPIRE federation already configured.
Shared object store (edb-object-storage secret) present and identical in both clusters.
Network open: 8444/TCP (SPIRE bundle endpoint), 9445/TCP (Beacon gRPC), plus your Postgres replication ports.
Tools installed:
jq
,yq
.
Discover important values (run against each cluster where noted):
Retrieve Primary portal host used by Beacon (append :9445)
kubectl get gw beacon-server -n upm-beacon -o json \ | jq -r '.spec.servers[1].hosts[0]'
Retrieve trust domain (run in each cluster)
kubectl get cm spire-server -n spire-system -o jsonpath="{['data']['server\.conf']}" \ | jq -r '.server.trust_domain'
Export environment variables you’ll reuse:
export KUBE_CONFIG_PRIMARY_CONTEXT="<primary-kube-ctx-or-arn>" export KUBE_CONFIG_SECONDARY_CONTEXT="<secondary-kube-ctx-or-arn>" export TRUST_DOMAIN_PRIMARY="<primary-trust-domain>" export TRUST_DOMAIN_SECONDARY="<secondary-trust-domain>" export BEACON_SERVER_ENDPOINT_PRIMARY="<primary-portal-fqdn>:9445" export SECONDARY_LOCATION_NAME="secondary" # any unique label
Multi-DC Beacon Helm install (Primary)
Extract the current
values.yaml
for Primary:kubectl get configmap -n edbpgai-bootstrap -l app=edbpgai-bootstrap -o yaml \ | yq eval '.items.0.data["values.yaml"]' > /tmp/primary-boot-values.yaml
Edit
/tmp/primary-boot-values.yaml
to allow the Secondary trust domain:beaconServer: # ... additionalTrustDomains: - "<secondary-location-trust-domain>"
Install/Reinstall Beacon on Primary (example):
./install-dev.sh -f <provider> -a install -c upm-beacon -v <upm-beacon-version> -p /tmp/primary-boot-values.yaml
Multi-DC Beacon Helm install (Secondary)
Extract the current
values.yaml for
Secondary:kubectl get configmap -n edbpgai-bootstrap -l app=edbpgai-bootstrap -o yaml \ | yq eval '.items.0.data["values.yaml"]' > /tmp/secondary-boot-values.yaml
Edit
/tmp/secondary-boot-values.yaml
to register this cluster as a managed location and to point the agent back to Primary:parameters: upm-beacon: # name can be anything, just not the same as Primary beacon_location_id: "secondary" beaconAgent: beaconServerAddress: "<primary-portal-fqdn>:9445" beaconServerTrustDomain: "<primary-trust-domain>" plaintext: false tlsInsecure: false inCluster: false
Install/Reinstall Beacon on Secondary:
./install-dev.sh -f <provider> -a install -c upm-beacon -v <upm-beacon-version> -p /tmp/secondary-boot-values.yaml
Beacon ClusterSPIFFEID federation (both directions)
Each Beacon SPIFFE ID that crosses DCs must include the peer trust domain in federatesWith
.
Use your helper script on both clusters.
From Secondary (add Primary trust domain):
./federate-beacon-spiffe-ids.sh "$TRUST_DOMAIN_PRIMARY" kubectl rollout restart -n upm-beacon deploy/upm-beacon-server kubectl rollout restart -n upm-beacon deploy/upm-beacon-agent-k8s
From Primary (add Secondary trust domain):
./federate-beacon-spiffe-ids.sh "$TRUST_DOMAIN_SECONDARY" kubectl rollout restart -n upm-beacon deploy/upm-beacon-server kubectl rollout restart -n upm-beacon deploy/upm-beacon-agent-k8s
The script finds Beacon ClusterSPIFFEID objects, creates a federatesWith list if missing, and appends the peer trust domain if not present.)
Validate wiring
On Primary: should list the managed Secondary location
kubectl get location
Validate SPIRE federation present on each cluster
kubectl -n spire-system exec svc/spire-server -c spire-server -- \ /opt/spire/bin/spire-server federation list
Expected:
kubectl get location
showsmanaged-<SECONDARY_LOCATION_NAME>
with recentLASTHEARTBEAT
.federation list
shows 1 relationship (the peer trust domain) withbundle endpoint profile: https_spiffe
and the peer’s:8444
URL.
Create the cross-DC Postgres topology
At this point HM can provision into the Secondary location. You still choose and create the actual DB topology.
Typical flow
From Primary HM, create the Postgres Primary in the Primary DC.
From Primary HM, create replica cluster(s) in the Secondary DC (select the managed Secondary location).
Confirm replication mode (sync/async) and monitor replication lag meets your SLOs.
Ensure backups are writing to the shared object store from both DCs, and test a restore.
Operational notes
- DB TLS is separate from SPIRE/Beacon (platform identity). Configure PG TLS per your policy.
- Verify StorageClasses in each DC meet PG IOPS/latency.
- Open replication ports between sites.
Validation (end-to-end)
Validate federation relationships
kubectl -n spire-system exec svc/spire-server -c spire-server -- \ /opt/spire/bin/spire-server federation list
Validate that the Secondary location is registered (Primary)
kubectl get location
Validate provisioning to Secondary
- From Primary HM, deploy a small test workload to the Secondary location.
- Telemetry (optional) Thanos stores show federated peer; Loki queries return logs tagged from Secondary.
- Object storage Both clusters can read/write the bucket; secrets identical.
Manual failover runbook
Manual failover procedure from Primary to Secondary
Quiesce writes to Primary (maintenance mode / LB cutover).
Promote replicas in Secondary to Primary (per your HM workflow / scripts).
Redirect clients (DNS/LB) to Secondary.
Observe: confirm writes succeed; replication role updated.
When original Primary returns: re-seed it as a replica of the new Primary; optionally plan a later cutback.
Operator tips
- Keep DNS TTL low enough for cutovers.
- Track downtime to measure RTO.
- Validate backups post-promotion.
Troubleshooting
Problem: No federation relationships
- Re-generate and cross-apply
ClusterFederatedTrustDomain
CRs. - Confirm 8444/TCP reachability.
- Re-generate and cross-apply
Problem: Secondary not listed in
kubectl get location
- Recheck Beacon values on both sides; restart Beacon server/agent.
- Confirm 9445/TCP reachability to Primary portal; trust domains correct.
Problem: Object store access fails on Secondary
- Re-sync
edb-object-storage
. - For EKS/IRSA: ensure Secondary OIDC is in the role’s trust policy.
- Re-sync
Problem: Telemetry federation missing
- Reinstall with the correct -l primary|secondary flags and unique prefixes.
- Check Thanos
/api/v1/stores
and Loki read API.
Problem: Replica lag / connectivity
- Verify network ACLs/SGs, TLS certs, and storage performance.
Appendix A — SPIRE federation via CLI (optional)
You can manage federation with spire-server federation (create, list, update, delete, show, refresh) instead of CRDs. Use this if you prefer direct server control or for debugging.
Appendix B — Quick daily checks
kubectl get location
on Primary shows Secondary Ready.- Thanos/Loki federation healthy (if enabled).
- Object store writes succeed from both DCs.
- Replication lag within SLOs.
- On this page
- Overview
- Prerequisites
- Object storage across locations
- Setup options
- SPIRE federation details (what/why/how)
- Beacon configuration (Primary & Secondary)
- Telemetry federation (optional)
- Database topology and cross-DC provisioning
- Validation (end-to-end)
- Manual failover runbook
- Troubleshooting
- Appendix A — SPIRE federation via CLI (optional)
- Appendix B — Quick daily checks