Known issues v1.3
These are the currently known issues and limitations identified in the Hybrid Manager 1.3 release. Where applicable, we have included workarounds to help you mitigate the impact of these issues. These issues are actively tracked and are planned for resolution in a future release.
Multi-DC
Multi-DC configuration loss after upgrade
Description: Configurations applied by the multi-DC setup scripts (for cross-cluster communication) do not persist after a Hybrid Manager platform upgrade or operator reconciliation.
Workaround: After every Hybrid Manager upgrade or component reversion, the multi-DC setup scripts must be run again to re-apply the necessary configurations.
Location dropdown list is empty for multi-DC setups
Description: In multi-data center environments, the API call to retrieve available locations fails with a gRPC message size error (429/4.3MB limit exceeded). This is due to the large amount of image set information included in the API response, resulting in an empty location list in the console.
Workaround: This advanced workaround requires cluster administrator privileges to limit the amount of image set information being returned by the API. It involves modifying the image discovery tag rules in the upm-image-library
and upm-beacon
ConfigMaps, followed by restarting the related pods.
Workaround details
The workaround modifies the regular expressions (tag rules) used by the image library and beacon components to temporarily limit the number of image tags being indexed. This reduces the API response size, allowing the locations to load.
Find the
upm-image-library
ConfigMap:kubectl get configmaps -n upm-image-library | grep upm-image-library # Example Output: upm-image-library-ttkt29fmf7 1 5d3h
Edit the ConfigMap found in the previous step and modify the
tags
rule under each image discovery rule (edb-postgres-advanced
,edb-postgres-extended
,postgresql
). Replace the existing regex with the limiting regex:# Snippet of the YAML you will edit in the ConfigMap "imageDiscovery": { "rules": { "(^|.*/)edb-postgres-advanced$": { "readme": "EDB postgres advanced server", "tags": [ "^(?P<major>\\d+)\\.(?P<minor>\\d+)-2509(?P<day>\\d{2})(?P<hour>\\d{2})(?P<minute>\\d{2}) (?:-(?P<pgdFlavor>pgdx|pgds))?(?:-(?P<suffix>full))?$" ] }, # ... repeat for edb-postgres-extended and postgresql ... } }
Note
This step must be performed on the primary Hybrid Manager cluster if you are running a multi-DC setup.
Restart the Image Library Pod:
kubectl rollout restart deployment upm-image-library -n upm-image-library
Get the
upm-beacon
ConfigMap to modify the Agent configuration:kubectl get configmaps -n upm-beacon beacon-agent-k8s-config
Edit the ConfigMap (
beacon-agent-k8s-config
) and modify thetag_regex
rule under eachpostgres_repositories
entry (edb-postgres-advanced
,edb-postgres-extended
,postgresql
).# Snippet of the YAML you will edit in the ConfigMap postgres_repositories: - name: edb-postgres-advanced description: EDB postgres advanced server tag_regex: "^(?P<major>\\d+)\\.(?P<minor>\\d+)-2509(?P<day>\\d{2})(?P<hour>\\d{2})(? P<minute>\\d{2})(?:-(?P<pgdFlavor>pgdx|pgds))?(?:-(?P<suffix>full))?$" # ... repeat for edb-postgres-extended and postgresql ...
Restart the Agent pod:
kubectl rollout restart deployment -n upm-beacon upm-beacon-agent-k8s
After completing these steps, the reduced image data size should allow the location API call to succeed, and the locations should appear correctly in the Hybrid Manager console.
Core platform and resources
upm-beacon-agent
memory limits are insufficient in complex environments
Description: In environments with many databases and backups, the default 1GB memory allocation for the upm-beacon-agent
pod is insufficient, which can lead to frequent OOMKill or crashloop issues. This resource limit is currently not configurable via the standard Helm values or HybridControlPlane CR.
Workaround: The user must manually patch the Kubernetes deployment to increase the memory resource limits for the upm-beacon-agent
pod.
Database cluster engine
Incorrect database name displayed for EDB Postgres Distributed (PGD) clusters
Description: The Hybrid Manager console's Connect tab and connection string incorrectly show the default database name for PGD clusters as edb_admin. PGD clusters require connection to the bdrdb database.
Workaround: For PGD cluster connection information, use one of the following reliable sources: the .PGPASS BLOB
, the .PG_SERVICE.CONF
file, or the full connection string from the cluster details page.
PGD-X cluster creation stuck in the "PGD - Reconcile application user" phase
Description: PGD-X cluster creation, particularly when involving a witness-only region, may stall due to:
- global RAFT leadership being unexpectedly held by the witness-only node, or due to
- a subgroup's
enable_routing
being disabled.
Workaround for RAFT Leadership issue: Manually trigger the transfer of the global RAFT lead to a node within a data group. Connect to the PGD cluster's bdrdb and execute:
bdr.raft_leadership_transfer(node_name:='<target node>', wait_for_completion:='true', node_group_name:='world');
Workaround for enable_routing issue: Manually enable routing for the subgroup. Connect to bdrdb and execute:
SELECT bdr.alter_node_group_option('<subgroup name>','enable_routing','true');
Failure to create 3-node PGD cluster when max_connections
is non-default
Description: Creating a 3-data node EDB Postgres Distributed (PGD) cluster fails if the configuration parameter max_connections
is set to a non-default value during initial cluster provisioning.
Workaround: Create the PGD 3-data node cluster using the default max_connections
value, and then update the value after the cluster has been successfully provisioned.
PGD database settings are not duplicated when creating or duplicating a second data group
Description: When creating or duplicating a second data group in a PGD cluster, Postgres settings (like max_connections
, max_worker_processes
, etc.) are not automatically copied from the first data group. This can lead to inconsistent settings and cluster health issues, as the replica group settings cannot be lower than the primary group.
Workaround: Manually edit the configuration for the second PGD group to ensure the database settings are identical to the first data group before provisioning.
AHA Witness node resources are over-provisioned
Description: For Advanced High Availability (AHA) clusters with witness nodes, the witness node incorrectly inherits the CPU, memory, and disk configuration of the larger data nodes, leading to unnecessary resource over-provisioning.
Workaround: Manually update the pgdgroup YAML configuration to specify and configure the minimal resources needed by the witness node.
HA clusters use verify-ca
instead of verify-full
for streaming replication certificate authentication
Description: Replica clusters use the less strict verify-ca
setting for streaming replication authentication instead of the recommended, most secure verify-full
. This is currently necessary because the underlying CloudNativePG (CNP) clusters do not support IP Subject Alternative Names (IP SANs), which are required for verify-full
in certain environments (like GKE Load Balancers).
Workaround: None. A fix is dependent on the underlying CNP component supporting IP SANs.
Second node is too slow to join large HA clusters
Description: For large clusters, the pg_basebackup
process used by a second node (standby) to join an HA cluster is too slow. This can cause the standby node to fail to join, which prevents scaling a single node to HA and also causes issues when restoring a cluster directly into an HA configuration.
Workaround: Avoid the best practice of loading data into a single node and then scaling to HA; instead, load data directly into an HA cluster from the start. There is no workaround for restoring a large cluster into an HA configuration.
Backup and recovery
Replica cluster creation fails when using volume snapshot recovery across regions
Description: Creating a replica cluster in a second location that is in a different region fails with an InvalidSnapshot.NotFound
error because volume snapshot recovery does not support cross-region restoration.
Workaround: Manually trigger a Barman backup from the primary cluster first, and then use that Barman backup (instead of the volume snapshot) to provision the cross-region replica cluster.
WAL archiving is slow due to default parallel configuration
Description: The default setting for wal.maxParallel
is too restrictive, which slows down WAL archiving during heavy data loads. This can cause a backlog of ready-to-archive WAL files, potentially leading to disk full conditions. This parameter is not yet configurable via the HM console.
Workaround: Manually edit the objectstores.barmancloud.cnpg.io
Kubernetes resource for the specific backup object store and increase the wal.maxParallel
value (e.g., to 20) to accelerate archiving.
AI Factory and model management
Failure to deploy nim-nvidia-nvclip
model with profile cache
Description: Model creation for the nim-nvidia-nvclip
model will fail within the AI Factory when the profile cache is utilized during the deployment process.
Workaround: The workaround requires an administrator to manually download the necessary model profile from the NVIDIA registry to a local machine. They must then upload the profile files directly to the Hybrid Manager's object storage path, and finally, deploy the model by patching the Kubernetes InferenceService YAML with a specific environment variable to force it to use the pre-cached files instead of attempting a failed network download.
Workaround details
Log in to the NVIDIA Container Registry (nvcr.io) using your NGC API key:
docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
Pull the Docker image to your local machine:
docker pull nvcr.io/nim/nvidia/nvclip:latest
Prepare a local directory for the downloaded profiles:
mkdir -p ./model-cache chmod -R a+w ./model-cache
Select the profile for your target GPU.
For example, A100 GPU profile:
9367a7048d21c405768203724f863e116d9aeb71d4847fca004930b9b9584bb6
Run the container to download the profile. The container is run in CPU-only mode (
NIM\_CPU\_ONLY=1
) to prevent GPU-specific initialization issues on the download machine:export NIM_MANIFEST_PROFILE=9367a7048d21c405768203724f863e116d9aeb71d4847fca004930b9b9584bb6 export NIM_CPU_ONLY=1 docker run -v ./model-cache:/opt/nim/.cache -u $(id -u) -e NGC_API_KEY -e NIM_CPU_ONLY -e NIM_MANIFEST_PROFILE --rm nvcr.io/nim/nvidia/nvclip:latest
This container will not exit. You must manually stop the run (Ctrl+C) after you see the line "Health method called" in the logs, which confirms the profile download is complete.
Upload the profiles from your local machine to the object storage bucket used by your Hybrid Manager deployment:
gcloud storage cp -r ./model-cache gs://uat-gke-edb-object-storage/model-cache/nim-nvidia-nvclip
Note
Adjust the
gs://
path to match your deployment's configured object storage location.Create the model
nim-nvidia-nvclip
using the HM console, specifying the Model Profiles Path field as the previous location (e.g.,/model-cache/nim-nvidia-nvclip
). The deployment will initially fail or become stuck.Export the InferenceService YAML from the Hybrid Manager Kubernetes cluster.
Add the necessary environment variable,
NIM_IGNORE_MODEL_DOWNLOAD_FAIL
, to the env section of thespec.predictor.model
block in the exported YAML. This flag tells the NIM container to use the locally available cache (the files you uploaded) and ignore the network download failure.# --- Snippet of the modified InferenceService YAML --- spec: predictor: minReplicas: 1 model: modelFormat: name: nim-nvidia-nvclip name: "" env: - name: NIM_IGNORE_MODEL_DOWNLOAD_FAIL # <-- ADD THIS LINE value: "1" # <-- ADD THIS LINE resources: # ... resource requests/limits ... runtime: nim-nvidia-nvclip storageUri: gs://uat-gke-edb-object-storage/model-cache/nim-nvidia-nvclip # ---------------------------------------------------
Apply the modified YAML using kubectl to force the deployment to use the pre-downloaded profiles:
kubectl apply -f <modified-inference-service-file.yaml> -n <model-cluster-namespace>
The pods should now proceed to start successfully, using the model profiles you manually provided via object storage.
AI Model Cluster deployment stalls if object storage path for model profiles is empty
Description: Creating an AI Model cluster and specifying an object storage path in the "Model Profiles Path field" will cause the deployment to stall at the pending stage if the specified path contains no content (i.e., the model profile does not yet exist).
Workaround: Ensure that the object storage path specified in the "Model Profiles Path field" contains a correct, valid profile before initiating the model cluster deployment.
Incorrect model name causes 404 error when calling LLM remotely
Description: When calling a deployed NVIDIA NIM model via the API endpoint, the model name displayed on the model card may not be the correct name required by the API. This results in a "404 Not Found error".
Workaround: To find the exact model name required for the API call (e.g., nvidia/llama-3.3-nemotron-super-49b-v1
), query the /v1/models
API endpoint first.
HM console and observability
Tags for active model clusters are not displayed on the model details screen
Description: In the table showing active model clusters utilizing a specific model on the Model Cluster Details screen, the Tags field is empty.
Workaround: Model cluster tags can still be viewed correctly on the dedicated Model Cluster Details page.
Chat model cluster metrics are missing from the Grafana model overview dashboard
Description: Metrics for deployed chat model clusters are not displayed on the Grafana model overview dashboard, impacting observability for these specific AI components.
User-created Grafana dashboards do not persist after platform redeployment/upgrade
Description: Dashboards created by users directly within the Grafana application are not stored in persistent storage. They disappear when the Grafana pods are updated, redeployed, or restarted (e.g., during an EKS auto-update or a Hybrid Manager upgrade).
Workaround: Any custom dashboards must be backed up externally by exporting the dashboard as JSON. After an upgrade, they must be manually imported back into Grafana.