Model Management on Hybrid Manager v1.3.2

The November 2025 Innovation Release of EDB Postgres AI is available. For more information, see the release notes.

Purpose and Benefits

Model management within Hybrid Manager provides centralized governance and deployment capabilities for AI models running on your Kubernetes infrastructure. This system enables organizations to maintain complete control over their AI capabilities while leveraging enterprise-grade Model Serving infrastructure.

The integration addresses critical requirements for organizations deploying AI at scale: model governance through approved registries, scalable inference serving with GPU acceleration, and unified management through Hybrid Manager's control plane. By running models within your controlled infrastructure, you maintain data sovereignty while accessing state-of-the-art AI capabilities.

Core Concepts

Model Library

The Model Library serves as your centralized governance system for AI model images. Operating within Hybrid Manager's Asset Library infrastructure, it provides a curated view of validated models ready for production deployment.

The library implements multi-stage governance:

Automated synchronization from trusted container registries
Security scanning and vulnerability assessment
Approval workflows based on organizational policies
Metadata management for versioning and documentation

Models in the library power all AI Factory capabilities including Gen AI assistants, Knowledge Base pipelines, and custom inference applications. Only models validated through the library's governance framework can reach production environments.

Model Serving

Model Serving transforms approved models into scalable inference endpoints using KServe within your Kubernetes clusters. This infrastructure provides production-grade model deployment with automatic scaling, health management, and resource optimization.

Key serving capabilities include:

InferenceService resources that define deployed model endpoints
ServingRuntime configurations optimized for different model frameworks
GPU allocation and scheduling for high-performance inference (see Setup GPU and Update GPU resources)
Internal and external endpoint access with authentication

Management Interface

Hybrid Manager provides unified management through its web console, abstracting Kubernetes complexity while maintaining full configurability. The interface enables:

Visual workflows for model deployment from library to serving
Resource allocation and scaling configuration
Monitoring dashboards for inference metrics and GPU utilization
Access control and endpoint management

Implementation Workflow

Model Registration

Organizations begin by configuring repository connections to trusted model sources. The Model Library synchronizes with external registries based on defined rules, automatically discovering and validating new model versions.

External Registry → Repository Rules → Security Scanning → Model Library

Repository rules determine which models enter your environment, implementing organizational policies at the point of ingestion. This automated approach reduces manual overhead while maintaining governance standards.

Model Deployment

Validated models deploy through guided workflows that configure serving infrastructure:

Model Selection: Browse available models in the library with metadata including version, performance characteristics, and resource requirements
Runtime Configuration: Select or create ServingRuntimes optimized for the model framework (vLLM, TensorRT-LLM, custom)
Resource Allocation: Define GPU, memory, and CPU requirements based on expected workload
Endpoint Configuration: Set up internal cluster access or external API endpoints with authentication (see Access KServe endpoints)

The system creates InferenceService resources that KServe manages, handling pod scheduling, health monitoring, and traffic routing automatically.

Operational Management

Deployed models operate under continuous monitoring with automatic scaling based on demand. Hybrid Manager provides visibility through:

Real-time inference metrics including latency and throughput
GPU utilization tracking for resource optimization
Error rates and health status for proactive maintenance
Cost analysis based on resource consumption

GPU Infrastructure

Resource Types

Hybrid Manager supports various GPU configurations to match workload requirements:

Development GPUs: Lower-tier GPUs (T4, RTX series) for testing and experimentation with time-slicing support for resource sharing Production GPUs: High-performance GPUs (A100, H100) with dedicated allocation for latency-sensitive inference workloads Multi-Instance GPUs: Partitioned GPUs enabling multiple smaller models to share hardware resources efficiently

Allocation Strategies

GPU scheduling occurs through Kubernetes device plugins that advertise available resources and enforce allocation policies. The system supports:

Exclusive GPU allocation for production workloads
Time-sliced sharing for development and testing
MIG partitions for efficient multi-model serving
Node affinity rules for workload placement

Resource quotas at the project level prevent GPU monopolization while ensuring fair access across teams. Automatic scaling adjusts GPU allocation based on inference demand, optimizing costs while meeting service requirements.

Integration Patterns

Gen AI Builder Applications

Gen AI Builder leverages model endpoints for assistants and agents. LLMs deployed through Model Serving provide text generation capabilities while embedding models support RAG pipelines.

Applications access models through cluster-local DNS:

http://model-name.namespace.svc.cluster.local/v1/chat/completions

Knowledge Base Pipelines

Knowledge bases utilize embedding models for document vectorization and semantic search. The Model Library ensures consistent model versions across ingestion and retrieval operations.

Pipeline integration benefits from:

Validated embedding models with known performance characteristics
Automatic model updates through library synchronization
Consistent vector dimensions across knowledge base operations

Custom Applications

Organizations deploy specialized models for business-specific requirements. The serving infrastructure supports various model types including vision models, reranking models, and custom fine-tuned variants.

External applications access models through secured endpoints with API key authentication, enabling integration with existing business systems while maintaining security boundaries.

Security and Governance

Access Control

Role-based access control governs model operations at multiple levels:

Library Access: Controls who can register and approve models
Deployment Permissions: Manages who can deploy models to serving
Endpoint Access: Determines who can invoke model endpoints

Integration with enterprise identity providers enables single sign-on while maintaining detailed audit trails for compliance requirements.

Model Governance

The governance framework ensures only validated models reach production:

Security scanning identifies vulnerabilities before deployment
Digital signatures verify model authenticity and integrity
Approval workflows enforce multi-stakeholder validation
Retention policies manage model lifecycle and deprecation

Data Protection

All model operations maintain data sovereignty within your infrastructure:

Models run exclusively on your Kubernetes clusters
Inference data never leaves your controlled environment
Network policies enforce traffic isolation
Encryption protects data in transit and at rest

Monitoring and Optimization

Performance Metrics

Comprehensive monitoring tracks model performance across multiple dimensions:

Inference Latency: End-to-end request processing time
Token Throughput: Generation rate for language models
Batch Efficiency: Utilization of batching capabilities
Cache Hit Rates: Effectiveness of response caching

These metrics inform optimization decisions including batch size tuning, resource allocation adjustments, and caching strategy refinements.

Resource Utilization

GPU monitoring provides visibility into hardware efficiency:

Memory usage patterns identifying optimization opportunities
Compute utilization indicating scaling requirements
Power consumption for cost analysis
Temperature monitoring for hardware health

Cost Management

The system tracks resource consumption for cost attribution:

GPU hours per model and project
Storage costs for model artifacts
Network transfer for inference traffic
Scaling patterns affecting resource costs

Best Practices

Model Selection

Choose models based on clear requirements analysis:

Match model capabilities to use case requirements
Consider resource constraints and cost implications
Validate performance characteristics through testing
Document model selection rationale for future reference

Deployment Strategy

Implement systematic deployment approaches:

Start with conservative resource allocation
Monitor actual usage before optimization
Use canary deployments for model updates
Maintain rollback capabilities for problematic deployments

Operational Excellence

Maintain high operational standards:

Establish baseline performance metrics
Implement comprehensive monitoring before production
Document configuration decisions and procedures
Regular review of resource utilization and costs

Getting Started

Begin your model management journey with these steps:

Enable Model Capabilities in your Hybrid Manager project through Settings → Features
Configure Repository Rules to connect trusted model sources
Deploy Your First Model using the guided deployment workflow
Monitor Performance through integrated dashboards

For detailed implementation guidance:

Model management within Hybrid Manager transforms AI deployment from experimental prototypes to production-grade services, maintaining sovereignty while delivering enterprise capabilities.

← Prev

Gen AI Builder on Hybrid Manager

↑ Up

AI Factory in Hybrid Manager