Known issues v1.3

These are the currently known issues and limitations identified in the Hybrid Manager 1.3 release. Where applicable, we have included workarounds to help you mitigate the impact of these issues. These issues are actively tracked and are planned for resolution in a future release.

Multi-DC

Multi-DC configuration loss after upgrade

Description: Configurations applied by the multi-DC setup scripts (for cross-cluster communication) do not persist after a Hybrid Manager platform upgrade or operator reconciliation.

Workaround: After every Hybrid Manager upgrade or component reversion, the multi-DC setup scripts must be run again to re-apply the necessary configurations.

Location dropdown list is empty for multi-DC setups

Description: In multi-data center environments, the API call to retrieve available locations fails with a gRPC message size error (429/4.3MB limit exceeded). This is due to the large amount of image set information included in the API response, resulting in an empty location list in the console.

Workaround: This advanced workaround requires cluster administrator privileges to limit the amount of image set information being returned by the API. It involves modifying the image discovery tag rules in the upm-image-library and upm-beacon ConfigMaps, followed by restarting the related pods.

Workaround details

The workaround modifies the regular expressions (tag rules) used by the image library and beacon components to temporarily limit the number of image tags being indexed. This reduces the API response size, allowing the locations to load.

  1. Find the upm-image-library ConfigMap:

    kubectl get configmaps -n upm-image-library | grep upm-image-library
    # Example Output: upm-image-library-ttkt29fmf7 1 5d3h
  2. Edit the ConfigMap found in the previous step and modify the tags rule under each image discovery rule (edb-postgres-advanced, edb-postgres-extended, postgresql). Replace the existing regex with the limiting regex:

    # Snippet of the YAML you will edit in the ConfigMap
    "imageDiscovery": {
      "rules": {
        "(^|.*/)edb-postgres-advanced$": {
          "readme": "EDB postgres advanced server",
          "tags": [
            "^(?P<major>\\d+)\\.(?P<minor>\\d+)-2509(?P<day>\\d{2})(?P<hour>\\d{2})(?P<minute>\\d{2})    (?:-(?P<pgdFlavor>pgdx|pgds))?(?:-(?P<suffix>full))?$"
          ]
        },
        # ... repeat for edb-postgres-extended and postgresql ...
      }
    }
    Note

    This step must be performed on the primary Hybrid Manager cluster if you are running a multi-DC setup.

  3. Restart the Image Library Pod:

    kubectl rollout restart deployment upm-image-library -n upm-image-library
  4. Get theupm-beacon ConfigMap to modify the Agent configuration:

    kubectl get configmaps -n upm-beacon beacon-agent-k8s-config
  5. Edit the ConfigMap (beacon-agent-k8s-config) and modify the tag_regex rule under each postgres_repositories entry (edb-postgres-advanced, edb-postgres-extended, postgresql).

    # Snippet of the YAML you will edit in the ConfigMap
    postgres_repositories:
      - name: edb-postgres-advanced
        description: EDB postgres advanced server
        tag_regex: "^(?P<major>\\d+)\\.(?P<minor>\\d+)-2509(?P<day>\\d{2})(?P<hour>\\d{2})(?    P<minute>\\d{2})(?:-(?P<pgdFlavor>pgdx|pgds))?(?:-(?P<suffix>full))?$"
      # ... repeat for edb-postgres-extended and postgresql ...
  6. Restart the Agent pod:

    kubectl rollout restart deployment -n upm-beacon upm-beacon-agent-k8s

After completing these steps, the reduced image data size should allow the location API call to succeed, and the locations should appear correctly in the Hybrid Manager console.


Core platform and resources

upm-beacon-agent memory limits are insufficient in complex environments

Description: In environments with many databases and backups, the default 1GB memory allocation for the upm-beacon-agent pod is insufficient, which can lead to frequent OOMKill or crashloop issues. This resource limit is currently not configurable via the standard Helm values or HybridControlPlane CR.

Workaround: The user must manually patch the Kubernetes deployment to increase the memory resource limits for the upm-beacon-agent pod.

Database cluster engine

Incorrect database name displayed for EDB Postgres Distributed (PGD) clusters

Description: The Hybrid Manager console's Connect tab and connection string incorrectly show the default database name for PGD clusters as edb_admin. PGD clusters require connection to the bdrdb database.

Workaround: For PGD cluster connection information, use one of the following reliable sources: the .PGPASS BLOB, the .PG_SERVICE.CONF file, or the full connection string from the cluster details page.

PGD-X cluster creation stuck in the "PGD - Reconcile application user" phase

Description: PGD-X cluster creation, particularly when involving a witness-only region, may stall due to:

  • global RAFT leadership being unexpectedly held by the witness-only node, or due to
  • a subgroup's enable_routing being disabled.

Workaround for RAFT Leadership issue: Manually trigger the transfer of the global RAFT lead to a node within a data group. Connect to the PGD cluster's bdrdb and execute:

bdr.raft_leadership_transfer(node_name:='<target node>', wait_for_completion:='true', node_group_name:='world');

Workaround for enable_routing issue: Manually enable routing for the subgroup. Connect to bdrdb and execute:

SELECT bdr.alter_node_group_option('<subgroup name>','enable_routing','true');

Failure to create 3-node PGD cluster when max_connections is non-default

Description: Creating a 3-data node EDB Postgres Distributed (PGD) cluster fails if the configuration parameter max_connections is set to a non-default value during initial cluster provisioning.

Workaround: Create the PGD 3-data node cluster using the default max_connections value, and then update the value after the cluster has been successfully provisioned.

PGD database settings are not duplicated when creating or duplicating a second data group

Description: When creating or duplicating a second data group in a PGD cluster, Postgres settings (like max_connections, max_worker_processes, etc.) are not automatically copied from the first data group. This can lead to inconsistent settings and cluster health issues, as the replica group settings cannot be lower than the primary group.

Workaround: Manually edit the configuration for the second PGD group to ensure the database settings are identical to the first data group before provisioning.

AHA Witness node resources are over-provisioned

Description: For Advanced High Availability (AHA) clusters with witness nodes, the witness node incorrectly inherits the CPU, memory, and disk configuration of the larger data nodes, leading to unnecessary resource over-provisioning.

Workaround: Manually update the pgdgroup YAML configuration to specify and configure the minimal resources needed by the witness node.

HA clusters use verify-ca instead of verify-full for streaming replication certificate authentication

Description: Replica clusters use the less strict verify-ca setting for streaming replication authentication instead of the recommended, most secure verify-full. This is currently necessary because the underlying CloudNativePG (CNP) clusters do not support IP Subject Alternative Names (IP SANs), which are required for verify-full in certain environments (like GKE Load Balancers).

Workaround: None. A fix is dependent on the underlying CNP component supporting IP SANs.

Second node is too slow to join large HA clusters

Description: For large clusters, the pg_basebackup process used by a second node (standby) to join an HA cluster is too slow. This can cause the standby node to fail to join, which prevents scaling a single node to HA and also causes issues when restoring a cluster directly into an HA configuration.

Workaround: Avoid the best practice of loading data into a single node and then scaling to HA; instead, load data directly into an HA cluster from the start. There is no workaround for restoring a large cluster into an HA configuration.

Backup and recovery

Replica cluster creation fails when using volume snapshot recovery across regions

Description: Creating a replica cluster in a second location that is in a different region fails with an InvalidSnapshot.NotFound error because volume snapshot recovery does not support cross-region restoration.

Workaround: Manually trigger a Barman backup from the primary cluster first, and then use that Barman backup (instead of the volume snapshot) to provision the cross-region replica cluster.

WAL archiving is slow due to default parallel configuration

Description: The default setting for wal.maxParallel is too restrictive, which slows down WAL archiving during heavy data loads. This can cause a backlog of ready-to-archive WAL files, potentially leading to disk full conditions. This parameter is not yet configurable via the HM console.

Workaround: Manually edit the objectstores.barmancloud.cnpg.io Kubernetes resource for the specific backup object store and increase the wal.maxParallel value (e.g., to 20) to accelerate archiving.

AI Factory and model management

Failure to deploy nim-nvidia-nvclip model with profile cache

Description: Model creation for the nim-nvidia-nvclip model will fail within the AI Factory when the profile cache is utilized during the deployment process.

Workaround: The workaround requires an administrator to manually download the necessary model profile from the NVIDIA registry to a local machine. They must then upload the profile files directly to the Hybrid Manager's object storage path, and finally, deploy the model by patching the Kubernetes InferenceService YAML with a specific environment variable to force it to use the pre-cached files instead of attempting a failed network download.

Workaround details
  1. Log in to the NVIDIA Container Registry (nvcr.io) using your NGC API key:

    docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
  2. Pull the Docker image to your local machine:

    docker pull nvcr.io/nim/nvidia/nvclip:latest
  3. Prepare a local directory for the downloaded profiles:

    mkdir -p ./model-cache
    chmod -R a+w ./model-cache
  4. Select the profile for your target GPU.

    For example, A100 GPU profile: 9367a7048d21c405768203724f863e116d9aeb71d4847fca004930b9b9584bb6

  5. Run the container to download the profile. The container is run in CPU-only mode (NIM\_CPU\_ONLY=1) to prevent GPU-specific initialization issues on the download machine:

    export NIM_MANIFEST_PROFILE=9367a7048d21c405768203724f863e116d9aeb71d4847fca004930b9b9584bb6
    export NIM_CPU_ONLY=1
    
    docker run -v ./model-cache:/opt/nim/.cache -u $(id -u) -e NGC_API_KEY -e NIM_CPU_ONLY -e NIM_MANIFEST_PROFILE --rm nvcr.io/nim/nvidia/nvclip:latest

    This container will not exit. You must manually stop the run (Ctrl+C) after you see the line "Health method called" in the logs, which confirms the profile download is complete.

  6. Upload the profiles from your local machine to the object storage bucket used by your Hybrid Manager deployment:

    gcloud storage cp -r ./model-cache gs://uat-gke-edb-object-storage/model-cache/nim-nvidia-nvclip
    Note

    Adjust the gs:// path to match your deployment's configured object storage location.

  7. Create the model nim-nvidia-nvclip using the HM console, specifying the Model Profiles Path field as the previous location (e.g., /model-cache/nim-nvidia-nvclip). The deployment will initially fail or become stuck.

  8. Export the InferenceService YAML from the Hybrid Manager Kubernetes cluster.

  9. Add the necessary environment variable, NIM_IGNORE_MODEL_DOWNLOAD_FAIL, to the env section of the spec.predictor.model block in the exported YAML. This flag tells the NIM container to use the locally available cache (the files you uploaded) and ignore the network download failure.

    # --- Snippet of the modified InferenceService YAML ---
    spec:
      predictor:
        minReplicas: 1
        model:
          modelFormat:
            name: nim-nvidia-nvclip
          name: ""
          env:
          - name: NIM_IGNORE_MODEL_DOWNLOAD_FAIL  # <-- ADD THIS LINE
            value: "1"                         # <-- ADD THIS LINE
          resources:
            # ... resource requests/limits ...
          runtime: nim-nvidia-nvclip
          storageUri: gs://uat-gke-edb-object-storage/model-cache/nim-nvidia-nvclip
    # ---------------------------------------------------
  10. Apply the modified YAML using kubectl to force the deployment to use the pre-downloaded profiles:

    kubectl apply -f <modified-inference-service-file.yaml> -n <model-cluster-namespace>

The pods should now proceed to start successfully, using the model profiles you manually provided via object storage.


AI Model Cluster deployment stalls if object storage path for model profiles is empty

Description: Creating an AI Model cluster and specifying an object storage path in the "Model Profiles Path field" will cause the deployment to stall at the pending stage if the specified path contains no content (i.e., the model profile does not yet exist).

Workaround: Ensure that the object storage path specified in the "Model Profiles Path field" contains a correct, valid profile before initiating the model cluster deployment.

Incorrect model name causes 404 error when calling LLM remotely

Description: When calling a deployed NVIDIA NIM model via the API endpoint, the model name displayed on the model card may not be the correct name required by the API. This results in a "404 Not Found error".

Workaround: To find the exact model name required for the API call (e.g., nvidia/llama-3.3-nemotron-super-49b-v1), query the /v1/models API endpoint first.

HM console and observability

Tags for active model clusters are not displayed on the model details screen

Description: In the table showing active model clusters utilizing a specific model on the Model Cluster Details screen, the Tags field is empty.

Workaround: Model cluster tags can still be viewed correctly on the dedicated Model Cluster Details page.

Chat model cluster metrics are missing from the Grafana model overview dashboard

Description: Metrics for deployed chat model clusters are not displayed on the Grafana model overview dashboard, impacting observability for these specific AI components.

User-created Grafana dashboards do not persist after platform redeployment/upgrade

Description: Dashboards created by users directly within the Grafana application are not stored in persistent storage. They disappear when the Grafana pods are updated, redeployed, or restarted (e.g., during an EKS auto-update or a Hybrid Manager upgrade).

Workaround: Any custom dashboards must be backed up externally by exporting the dashboard as JSON. After an upgrade, they must be manually imported back into Grafana.