Hybrid Manager HA/DR recovery

Disaster can affect Hybrid Manager (HM) and make it unusable. For example, unavailability of the CSP region for an EKS-based appliance or an outage in a datacenter that makes a hardware appliance unusable can occur. A disaster recovery (DR) option allows you to restore your databases at the point-in-time from your available backups.

HM backups are handled with Velero.

There are two possible scenarios for recovering HM:

  • Restore HM to original location: You have two data centers (DC1, DC2), and HM runs in DC1. You need to restore HM from object storage to DC1.

  • Restore HM to alternative location: You have two data centers (DC1, DC2), and HM runs in DC1. You need to restore HM from object storage to DC2.

DR scope

The DR procedures address the following:

  • The Postgres clusters that you created in the appliance.
  • The custom managed storage locations as defined internally in the appliance in the associated s3-compatible storage area.
Note

The DR procedures don't cover the migration components, although you can use them to restore the original appliance transporter-db and migration-db databases.

RTO and RPO

The ability to do any restore, the associated recovery time objective (RTO), and recovery point objective (RPO) depend on the frequency and size of the backups.

As those factors have significant variation depending on the criticality assigned to the environment and the nature of your data, you don't know RTO and RPO values in advance. We recommend that you properly prepare the environment and perform periodic disaster recovery exercises to ensure your RTO and RPO requirements can be met.

Backup readiness

Each appliance has a linked s3-compatible storage that stores:

  • Internal backups (HM appliance data)
  • Postgres backups (Postgres database backups)

You can also define custom storage locations in the same bucket to be used in the platform.

All of this data needs to be available after a disaster. Depending on the criticality of the data and the level of disaster that you want to be able to recover from, you’ll need to replicate this data outside of the CSP region or physical datacenter where the appliance resides.

Tip

When using an AWS S3 bucket, you can achieve replication by using cross-region replication.

Postgres databases use continuous backup by default, so you can restore them at any point in time. They are limited only by backup lifecycle policies.

Critical appliance data, such as the definition of the Postgres clusters, is stored as Kubernetes objects and included in the Velero backup. By default, this backup happens daily at 23:00, as defined by the default schedule velero-backup-kube-state.

If your RPO requires more frequent backups, you can define a new backup schedule.

Danger

Do not modify the default schedule, as it may be overwritten by an appliance software update.

The following example shows a custom schedule to back up the needed resources each hour:

apiVersion: velero.io/v1  
kind: Schedule  
metadata:  
  name: custom-velero-backup-kube-state  
  namespace: velero  
spec:  
  schedule: 0 \* \* \* \*  
  skipImmediately: false  
  template:  
    includedNamespaces:  
    \- '\*'  
    includedResources:  
    \- storagelocations.biganimal.enterprisedb.com  
    \- clusterwrappers.beacon.enterprisedb.com  
    \- backupwrappers.beacon.enterprisedb.com  
    snapshotVolumes: false  
    ttl: 168h

DR procedure

The DR procedure is defined as the series of manual steps that you need to take from the deployment of a new appliance to the moment that it’s possible to restore your Postgres clusters using the normal restore procedure.

Warning

The procedure is based on the 1.0 release of the appliance and is subject to constant change as the feature set changes. You must constantly test and update it for it to remain valid.

1. Confirm availability of backups

The first step ensures the backups of the unavailable appliance (aka “old backups”) are reachable from the new appliance.

You can achieve this in multiple ways:

  • Using a replicated bucket as the s3-compatible linked bucket for the new appliance, so the old backups are directly available to the new appliance.
  • Copying the backups of the damaged appliance to the linked storage of the new appliance. You must copy the following items:
    • Internal EDB backups folder, with the format edb-internal-backups/\<random-string\>
    • The Postgres clusters backups folder customer-pg-backups
    • Any folder corresponding to a defined custom storage location
Note

The internal backups folder defined for the new appliance will be different from the older one, as it will have a different \<random-string\>.

2. Preparation steps

Define a recovery backup storage location for Velero

Once you have backups available, you can define a new storage location for Velero so you can restore resources from the damaged appliance backups. This is a read-only location to prevent overwriting or removing those backups.

To define a new storage location, use the following Kubernetes manifest:

apiVersion: velero.io/v1  
kind: BackupStorageLocation  
metadata:  
  annotations:  
    appliance.enterprisedb.com/s3-prefixes: edb-internal-backups/\<old-backups-ramdom-string\>/velero  
  labels:  
    appliance.enterprisedb.com/s3-credentials: bound  
  name: recovery  
  namespace: velero  
spec:  
  accessMode: ReadOnly  
  config:  
    insecureSkipTLSVerify: "false"  
    region: \<region-of-attached-bucket\>  
    s3ForcePathStyle: "true"  
  default: false  
  objectStorage:  
    bucket: \<linked-bucket-name\>  
    prefix: edb-internal-backups/\<old-backups-random-string\>/velero  
  provider: aws

Confirm it using the velero get backup-locations command. It must show as Available. If the status is not Available, check the Velero pod logs for permission errors on the s3 bucket.

Choosing a Velero backup for recovery

Once the old internal Velero backups are available in the recovery storage location, you can list them with the following command:

velero get backups \--selector velero.io/storage-location=recovery

Typically, you choose the latest available completed backup to recover from. Note the Velero backup name, as well as the date and time (UTC), as both are required for a restore.

Example:

NAME                                      STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR  
velero-backup-kube-state-**20241216154403**   Completed   0        0          2024-12-16 16:44:03 \+0100 CET   5d        recovery           \<none\>
Note

The timestamp value is referred to as the recovery date in the instructions that follow.

Additional requirements

The following requirements apply to the recovery procedure:

  • The new appliance must be running the same version of the Postgres AI software deployment as the old one.
  • The same locations (locations.beacon.enterprisedb.com custom resource) used in the old appliance are available in the new one. Locations is currently an internal resource created during install and isn't available in the console. managed-devspatcher is the default value.
  • Container images used to build the clusters in the old appliance are available to the new one.

3. Recovery steps

Restore EDB internal databases (app-db and beacon-db)

Once the old backups are available, you can restore the EDB internal databases. For each internal database:

  1. Save the cluster manifest to a yaml file: kubectl get \<cluster-name\> \-o yaml \>\<cluster-name\>.yaml.
  2. Edit the cluster spec in the yaml file so the cluster is created from the backups:
  • Replace the init section in bootstrap with a recovery section:
    recovery:  
      database: \<database name as in the init section\>  
      owner: \<owner name as in the init section\>  
      source: \<pg-cluster-name\>  
      secret:  
        name: \<secret name as in the init section\>  
      recoveryTarget:  
        targetTime: "\<recovery date in YYYY-MM-DD HH:MM:SS+00 format\>"  
  • Add the following section:
    externalClusters:  
    \- barmanObjectStore:  
        destinationPath:  S3://\<linked-bucket-name\>/edb-internal-backups/\<old-backups-random-string\>/databases  
        s3Credentials:  
          inheritFromIAMRole: true  
        wal:  
          maxParallel: 8  
      name: \<pg-cluster-name\>  
  • Add the following prefix to the appliance.enterprisedb.com/s3-prefixes annotation of the inheritedMetdata section (the list is comma separated):
     edb-internal-backups/\<old-backups-random-string\>/databases/\<db-name\>  
  1. Delete the cluster:

    kubectl delete cluster \<cluster-name\>)  
  2. Clean the backup area for the cluster:

    aws s3 rm s3://\<linked-bucket-name\>/edb-internal-backups/\<new-backups-random-string\>/databases/\<pg-cluster-name\>  \--recursive
  3. Apply the yaml file for the cluster to be re-created: kubectl apply \-f \<cluster-name\>.yaml

  4. After the cluster is successfully restored and in a healthy state, restart the accm-server in the namespace upm-beaco-ff-base.

At this point, the portal on the new cluster is available again.

Configure the Velero plugin

The plugin helps restore the Kubernetes resources in a correct state, so only the custom managed storage locations are restored. The Postgres clusters resources are restored as deleted, so you can later restore data as desired.

The plugin configuration is made through a ConfigMap, so you must apply this manifest:

apiVersion: v1  
kind: ConfigMap  
metadata:  
  name: velero-plugin-for-edbpgai  
  namespace: velero  
  labels:  
    velero.io/plugin-config: ""  
    enterprisedb.io/edbpgai-plugin: RestoreItemAction  
data:  
  \# configure disaster recovery mode, so restored items are transformed as needed  
  drMode: "true"  
  \# configure a date corresponding to the velero backup date. Note the format\!  
  drDate: "\<recovery date in YYYY–MM-DDTHH:MM:SSZ format\>”  
  \# old and new buckets for internal custom storage locations  
  oldBucket: \<old-appliance-bucket-name\>  
  newBucket: \<new-appliance-bucket-name\>

Restore the custom managed storage locations

Configure and apply the following Velero restore resource manifest:

apiVersion: velero.io/v1  
kind: Restore  
metadata:  
  name: restore-1-storagelocations  
  namespace: velero  
spec:  
  \# Change the backup name to a custom backup name as required  
  backupName: \<velero-backup-name\>  
  includedResources:  
  \- storagelocations.biganimal.enterprisedb.com  
  includeClusterResources: true  
  labelSelector:  
    matchLabels:  
      biganimal.enterprisedb.io/reserved-by-biganimal: "false"

Restore the cluster wrappers

Configure and apply the following Velero restore resource manifest:

apiVersion: velero.io/v1  
kind: Restore  
metadata:  
  name: restore-2-clusterwrappers  
  namespace: velero  
spec:  
  \# Change the backup name to a custom backup name as required  
  backupName: \<velero-backup-name\>  
  includedResources:  
  \- clusterwrappers.beacon.enterprisedb.com  
  restoreStatus:  
    includedResources:  
    \- clusterwrappers.beacon.enterprisedb.com

Restore the backup wrappers

Configure and apply the following Velero restore resource manifest:

apiVersion: velero.io/v1  
kind: Restore  
metadata:  
  name: restore-3-backupwrappers  
  namespace: velero  
spec:  
  \# Change the backup name to a custom backup name as required  
  backupName: \<velero-backup-name\>  
  includedResources:  
  \- backupwrappers.beacon.enterprisedb.com  
  restoreStatus:  
    includedResources:  
    \- backupwrappers.beacon.enterprisedb.com

Could this page be better? Report a problem or suggest an addition!