Model Serving Reference Manual v1.3.2

The November 2025 Innovation Release of EDB Postgres AI is available. For more information, see the release notes.

Model Serving transforms validated AI models into production-ready inference endpoints that integrate seamlessly with EDB Postgres AI capabilities. The system deploys containerized models as scalable services within Kubernetes environments managed by Hybrid Manager.

Prerequisites: Model Serving requires Hybrid Manager installation and Asset Library configuration for model image management. All inference services operate within your controlled Kubernetes environment.

System Architecture

KServe Foundation

Model Serving utilizes KServe as the underlying orchestration engine, providing enterprise-grade capabilities for model deployment, scaling, and lifecycle management. KServe abstracts the complexity of containerized model deployment while maintaining comprehensive control over inference operations.

Core Components:

ServingRuntime: Framework-specific execution environments optimized for different model types
InferenceService: Production endpoints with automatic scaling and health management
Resource Management: GPU allocation and compute optimization across deployed models

Integration Framework

The serving infrastructure operates as a bridge between the Model Library's governed model catalog and EDB PG AI's operational capabilities. Models approved through governance workflows become available for deployment as inference services that support various AI workloads.

Supported Model Types:

Large Language Models (LLMs) for text generation and completion
Embedding models for vector search and retrieval augmented generation
Vision models including CLIP, OCR, and image embedding capabilities
Reranking models for search result optimization

Operational Characteristics

Deployment Patterns

InferenceServices support multiple deployment strategies based on organizational requirements and model characteristics. The system accommodates both high-availability production deployments and development environments with appropriate resource allocation and scaling behaviors.

Deployment Configurations:

Single-model deployments for dedicated inference workloads
Multi-model serving for resource efficiency with compatible models
Canary deployments for safe production updates with gradual traffic shifting
A/B testing configurations for model performance comparison

Scaling Mechanisms

Automatic scaling adjusts compute resources based on inference demand, optimizing cost while maintaining service level objectives. The system supports both horizontal pod scaling and vertical resource adjustment based on workload characteristics.

Scaling Factors:

Request concurrency and queue depth
Model-specific latency requirements
GPU memory utilization patterns
Custom metrics for specialized workloads

Resource Management

GPU resources are managed through centralized allocation strategies that optimize utilization across multiple models. The system supports various GPU configurations and provides isolation mechanisms to prevent resource contention.

Resource Considerations:

GPU memory requirements for model loading and inference
CPU allocation for preprocessing and postprocessing operations
Network bandwidth for high-throughput inference workloads
Storage requirements for model artifacts and temporary data

Access and Integration

Endpoint Configuration

InferenceServices expose standardized API endpoints that provide compatibility with common AI frameworks and applications. The system supports both internal cluster communication and external access through secured gateway configurations.

Internal Access Patterns:

Cluster-local DNS resolution for service-to-service communication
Network policy enforcement for security isolation
Service mesh integration for advanced traffic management

External Access Configuration:

Portal-based endpoints with authentication and authorization
API key management for secure external integration
Rate limiting and quota enforcement capabilities

API Compatibility

Standard endpoint paths provide compatibility with OpenAI-compatible clients and custom applications, enabling seamless integration with existing AI workflows and toolchains.

Endpoint Specifications:

Chat completions (LLMs):     <base>/v1/chat/completions
Embeddings:                  <base>/v1/embeddings
Reranking:                   <base>/v1/ranking

EDB PG AI Integration

Model Serving provides inference capabilities for all EDB PG AI components through consistent interfaces that abstract deployment complexity while maintaining operational control.

Integration Points:

AI Accelerator Pipelines leverage embedding and preprocessing models for data transformation workflows
Knowledge Bases utilize embedding models for vector search and semantic retrieval operations
Gen AI Builder applications access LLMs for conversational AI and text generation capabilities

Configuration Framework

ServingRuntime Management

ServingRuntime configurations define the execution environment for different model frameworks, providing optimized performance characteristics while abstracting framework-specific complexity from deployment operations.

Runtime Optimization Features:

Framework-specific optimizations for model loading and inference
GPU kernel optimization for improved throughput
Memory management strategies for large model deployment
Batch processing capabilities for improved resource utilization

Configure ServingRuntime →

Resource Allocation

Proper resource configuration ensures optimal performance while avoiding resource contention and out-of-memory conditions. Organizations must balance performance requirements with infrastructure costs when defining resource specifications.

Critical Configuration Areas:

CPU and memory requests/limits for container scheduling
GPU allocation and sharing strategies
Network and storage performance requirements
Health probe and timeout configurations

Update GPU Resources →

Observability Configuration

Comprehensive monitoring enables proactive management of inference services while providing visibility into performance characteristics and resource utilization patterns.

Monitoring Capabilities:

Request latency and throughput metrics
Resource utilization tracking across CPU, memory, and GPU
Error rate monitoring with detailed failure analysis
Custom metrics for application-specific requirements

Model Observability →

Operational Considerations

Performance Optimization

Model Serving performance depends on multiple factors including model characteristics, hardware configuration, and traffic patterns. Organizations should establish baseline performance metrics and continuously optimize based on production usage patterns.

Optimization Strategies:

Batch size tuning for optimal throughput versus latency trade-offs
Quantization techniques for memory efficiency without significant accuracy loss
Caching strategies for frequently accessed models and predictions
Request routing optimization for multi-model deployments

Security Framework

Production model serving requires comprehensive security controls that protect both model assets and inference data while maintaining operational efficiency.

Security Controls:

Network isolation through Kubernetes network policies
Authentication and authorization for API access
Model image security scanning and vulnerability management
Audit logging for compliance and forensic analysis

Scalability Limitations

Understanding system limitations enables appropriate capacity planning and architectural decisions for large-scale deployments.

Scaling Constraints:

GPU memory limits model size and batch processing capabilities
Network bandwidth affects high-throughput inference workloads
Storage I/O performance impacts model loading and checkpoint operations
Kubernetes cluster limits affect maximum concurrent deployments

Troubleshooting Framework

Common Issues

Model Serving deployments may encounter various operational issues that require systematic troubleshooting approaches.

Typical Problems:

HTTP 404 errors often indicate incorrect endpoint paths or service misconfiguration
Resource allocation failures typically result from insufficient GPU or memory availability
Model loading timeouts may require adjusted health probe configurations
Performance degradation often indicates resource contention or suboptimal batch sizing

Diagnostic Procedures

Systematic diagnostic approaches enable efficient issue resolution and minimize service disruption during troubleshooting operations.

Diagnostic Steps:

Verify InferenceService status and event logs
Check resource allocation and utilization metrics
Validate network connectivity and endpoint accessibility
Review model loading logs and container status

Verify Models and GPU Usage →

Implementation Patterns

Development Environments

Development and testing environments benefit from simplified configurations that prioritize rapid iteration over production-grade reliability and performance characteristics.

Development Configuration:

Reduced resource allocation for cost efficiency
Simplified networking without external access requirements
Relaxed health check and timeout configurations
Shared GPU resources across multiple development models

Production Deployments

Production environments require comprehensive configuration that ensures reliability, performance, and security while supporting operational requirements.

Production Requirements:

High-availability configurations with appropriate redundancy
Comprehensive monitoring and alerting integration
Security controls aligned with organizational policies
Capacity planning based on anticipated workload characteristics

Hybrid Architectures

Organizations may deploy models across multiple infrastructure tiers based on performance, security, or compliance requirements while maintaining consistent operational procedures.

Hybrid Considerations:

Consistent model versions across different deployment environments
Network connectivity and latency optimization between environments
Security boundary management for cross-environment communication
Unified monitoring and management across hybrid deployments

Getting Started

Initial Deployment

Organizations beginning with Model Serving should start with simple configurations and gradually add complexity based on operational experience and requirements.

Setup Sequence:

Verify Hybrid Manager and Asset Library installation
Configure basic ServingRuntime for target model framework
Deploy initial InferenceService with conservative resource allocation
Validate functionality through endpoint testing and monitoring

Configuration Resources

Best Practices

Start with proven ServingRuntime configurations before customization
Implement comprehensive monitoring before production deployment
Establish resource allocation guidelines based on model characteristics
Document configuration decisions and operational procedures

Reference Documentation

Core Architecture:

Implementation Guides:

Integration Documentation:

Model Serving transforms governed AI models into production-ready inference capabilities that scale with organizational requirements while maintaining comprehensive control over performance, security, and operational characteristics.

↑ Up

AI Factory Models

Model Library Reference Manual