Deploy AI Factory with Hybrid Manager UI v1.3.2

The November 2025 Innovation Release of EDB Postgres AI is available. For more information, see the release notes.

Overview

This guide provides deployment procedures for AI Factory components through the Hybrid Manager (HCP) web interface. The deployment process covers GPU infrastructure setup, NVIDIA NIM model deployment, and Gen AI application configuration.

Prerequisites

Infrastructure Requirements

Before deploying AI Factory components:

Kubernetes cluster with GPU nodes configured
NVIDIA GPU operator installed
Access to NVIDIA NGC registry or private registry
Object storage for model profiles (air-gapped deployments)

GPU Node Configuration

Verify GPU nodes meet NVIDIA NIM requirements:

Model Type	NIM Model	GPU Requirements	Container Image
Text Completion	llama-3.3-nemotron-super-49b-v1	4 x L40S	nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1:1.8.5
Text Embeddings	llama-3.2-nemoretriever-300m-embed-v1	1 x L40S	nvcr.io/nim/nvidia/llama-3.2-nemoretriever-300m-embed-v1:latest
Image Embeddings	nvclip	1 x L40S	nvcr.io/nim/nvidia/nvclip:latest
OCR	paddleocr	1 x L40S	nvcr.io/nim/baidu/paddleocr:latest
Text Reranking	llama-3.2-nv-rerankqa-1b-v2	1 x L40S	nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:latest

Ensure GPU nodes include:

Label: nvidia.com/gpu=true
Taint: nvidia.com/gpu

Step 1: Configure Registry Authentication

Internet-Connected Deployments

For clusters with internet access, configure NVIDIA NGC authentication:

Obtain NGC API key from NVIDIA NGC Portal
Create authentication secrets via HCP UI:

Navigate to Project Settings → Secrets
Create nvidia-nim-secrets with NGC_API_KEY
Create ngc-cred Docker registry secret

Alternatively, use kubectl:

NGC_API_KEY=<your-ngc-api-key>

# Create runtime secret
kubectl -n default create secret generic nvidia-nim-secrets \
    --from-literal=NGC_API_KEY=${NGC_API_KEY}

kubectl -n default annotate secret nvidia-nim-secrets \
    replicator.v1.mittwald.de/replicate-to='m-.*'

# Create image pull secret
kubectl -n default create secret docker-registry ngc-cred \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password=${NGC_API_KEY}

kubectl -n default annotate secret ngc-cred \
    replicator.v1.mittwald.de/replicate-to='m-.*'

Air-Gapped Deployments

For environments without internet access:

Mirror NIM Images: Copy required images to private registry using skopeo
Update Model URLs: Configure HCP to reference private registry locations
Cache Profiles: Download and store model profiles in object storage
Configure Storage Path: Reference cached profiles during model deployment

See Air-Gapped Configuration for detailed procedures.

Step 2: Access Model Library

Navigate to the Model Library interface:

Log into Hybrid Manager console
Select your project from the project selector
Navigate to AI Factory → Model Library

The Model Library displays available NVIDIA NIM models:

llama-3.3-nemotron-super-49b-v1: Advanced reasoning and chat capabilities (128K context)
llama-3.2-nemoretriever-300m-embed-v1: High-quality text embeddings
llama-3.2-nv-rerankqa-1b-v2: Multilingual query-document reranking
paddleocr: Ultra-lightweight OCR system
nvclip: Multimodal embeddings for image and text

Step 3: Deploy Model Server Cluster

Select Model for Deployment

Click Deploy Model in Model Library
Select target model from available options
Review model requirements and documentation links

Configure Deployment Parameters

Configure the model server cluster settings:

Instance Configuration

Server Instances: Default 1 (increase for high availability)
Minimum Instances: 1 (0 for scale-to-zero when supported)
Maximum Instances: Based on load requirements

Resource Allocation

Memory: Configure based on model requirements
Text completion models: 64-128 GB
Embedding models: 32-64 GB
CPU: Default values typically sufficient
GPU: Match documented requirements
llama-3.3-nemotron: 4 GPUs
Other models: 1 GPU typical

Scaling Configuration

Concurrent Request Threshold: Configure autoscaling trigger
Scale to Zero: Enable when supported (may be unavailable in initial release)

Deploy Model

Review configuration summary
Click Deploy to initiate deployment
Monitor deployment status in Model Server Clusters view

Deployment creates:

KServe InferenceService resource
Predictor pods with configured resources
Service endpoints for model access

Step 4: Monitor Deployment Status

View Model Server Clusters

Navigate to AI Factory → Model Server Clusters to view:

Cluster display name
Load balancer ingress URI
Deployment status (Active/Healthy, Pending, Failed)
Model details and tags
Resource utilization metrics

Access Cluster Details

Click on a cluster name to view:

Detailed configuration parameters
Real-time health metrics
Grafana dashboard integration
Inference latency charts
System health indicators

Step 5: Configure API Access

Generate API Tokens

For external access to deployed models:

Navigate to AI Factory → API Token Management
Click Create Token
Provide token reference name
Store generated token securely

API tokens enable:

RAG application authentication
External service integration
Programmatic model access

Access Endpoints

Models expose OpenAI-compatible endpoints:

Internal Access (within cluster):

http://<service-name>.<namespace>.svc.cluster.local/v1/chat/completions

External Access (with ingress):

https://<ingress-url>/v1/chat/completions

Step 6: Build Gen AI Applications

Create Knowledge Base

Navigate to AI Factory → Gen AI Builder → Knowledge Bases
Configure data sources:

PostgreSQL databases
Document repositories
API connections

Select embedding model from deployed services
Process content to generate embeddings

Configure Assistant

Navigate to AI Factory → Assistants
Create new assistant with:

Name and directive
Model endpoint selection
Knowledge base integration
Tool configuration

Deploy assistant application

See Create Assistant Guide for detailed procedures.

Step 7: Update and Manage Models

Edit Model Server Cluster

To modify running clusters:

Navigate to Model Server Clusters
Select cluster to edit
Adjust parameters:

Instance counts
Resource allocations
Scaling thresholds

Apply changes (triggers rolling update)

Rolling Updates

Updates maintain availability through:

Connection draining before restart
Health verification between instances
Automatic rollback on failure

Troubleshooting

Common Deployment Issues

Model Fails to Start

Verify GPU availability matches requirements
Check registry authentication secrets
Review pod events and logs

High Inference Latency

Adjust batch size parameters
Increase GPU allocation
Scale replicas for load distribution

API Token Issues

Verify token hasn't expired
Check network policies
Confirm ingress configuration

For detailed diagnostics, see Troubleshooting Guide.

Additional Resources

Reference Documentation

NVIDIA Model Documentation

← Prev

Prerequisites for AI Factory on Hybrid Manager

↑ Up

AI Factory in Hybrid Manager