Model Serving Concepts v1.2
Model Serving is a core capability of EDB Postgres® AI (EDB PG AI), enabling scalable, flexible, and high-performance serving of AI/ML models on Kubernetes.
It powers:
- Gen AI applications
 - Intelligent retrieval systems
 - Advanced data pipelines
 
Model Serving is implemented using KServe, an open-source Kubernetes-native engine for standardized model inferencing. AI Factory integrates KServe with Hybrid Manager to provide enterprise-grade lifecycle management, security, and observability.
Key to Sovereign AI: Models run on your Kubernetes clusters, under your control, with full observability and governance.
Before you start
Prerequisites for understanding Model Serving:
- Familiarity with Kubernetes basics (pods, services, deployments)
 - Understanding of InferenceService as a Kubernetes CRD
 - Awareness of Model Library and how models are registered and deployed
 - Understanding of Sovereign AI principles in EDB PG AI
 
Suggested path:
Why it matters
Model Serving enables your AI models and Postgres data to work together seamlessly — securely and scalably — under Sovereign AI principles:
- Deploy open-source or commercial models to your Kubernetes cluster.
 - Serve Gen AI models for Assistants, Knowledge Bases, and RAG pipelines.
 - Support multi-modal retrieval (text, image, hybrid search).
 - Optimize performance with GPU acceleration and server-side batching.
 - Maintain full observability, auditing, and governance over model usage.
 
See also: Hybrid Manager Model Serving integration in production environments.
Core concepts
InferenceService (via KServe)
At the core of Model Serving is the InferenceService — a Kubernetes-native resource that represents a deployed model.
It defines the end-to-end serving pipeline:
- Predictor — Runs the model server and handles inference.
 - Transformer (optional) — Applies pre-processing or post-processing.
 - Explainer (optional) — Provides model explainability outputs.
 
Predictor
The Predictor defines:
- Model format — PyTorch, TensorFlow, ONNX, Triton, etc.
 - Model location — S3-compatible storage, OCI registry, PVC.
 - Resources — CPU, memory, GPU.
 - Autoscaling — Policies for elastic scaling, including scale-to-zero.
 
ServingRuntime / ClusterServingRuntime
Reusable runtime definitions for serving:
- ServingRuntime — Namespace-scoped.
 - ClusterServingRuntime — Cluster-wide reusable runtimes.
 
Benefits:
- Tailor runtime settings to model type and hardware.
 - Standardize runtime configurations across teams and projects.
 
How it works
Lifecycle flow
- Register model image in Model Library.
 - Deploy model via Model Serving console or CLI → creates InferenceService.
 - AI Factory + KServe provision Kubernetes resources (pods, services).
 - Model is loaded into runtime container.
 - Kubernetes service endpoint is exposed.
 - Clients send HTTP/gRPC inference requests.
 - Requests may pass through Transformers and Explainers.
 - Inference response is returned.
 
Key features
- Multi-framework support — PyTorch, TensorFlow, ONNX, XGBoost, Triton, and more.
 - GPU acceleration — Native NVIDIA GPU support.
 - Autoscaling — Including scale-to-zero via Knative.
 - Observability — Prometheus metrics, Kubernetes logging, and Hybrid Manager dashboards.
 - Batching — Server-side batching for improved throughput.
 - Explainability — Support for model explainability tooling.
 - Security and Sovereign AI — Models run on your Kubernetes clusters under your control.
 
Patterns of use
Gen AI Builder
- Serve LLMs and multi-modal models powering Assistants and Agents.
 
Knowledge Bases
- Serve embedding and retrieval models used in:
 - Knowledge Base indexing
 - RAG pipelines
 
Custom applications
- Expose InferenceService endpoints to:
 - Business applications
 - Microservices
 - ETL/ELT pipelines
 
Hybrid + Sovereign AI alignment
- All models run on your infrastructure via Hybrid Manager KServe layer.
 - You control:
 - Which models are deployed
 - Resource allocations
 - Deployment topology
 - Observability and auditing
 
Best practices
- Always deploy models through the Model Library → Model Serving flow to ensure governance.
 - Monitor resource consumption — especially GPU utilization.
 - Test scale-to-zero policies carefully before production use.
 - Use ServingRuntime and ClusterServingRuntime templates for consistency.
 - Tag and document production models clearly in Model Library.
 - Audit InferenceService deployments regularly — critical for Sovereign AI.
 
In AI Factory
Model Serving powers multiple components in EDB PG AI:
| Component | How it uses Model Serving | 
|---|---|
| Gen AI Builder | Runs LLMs and specialized models for Assistants | 
| Knowledge Bases | Serves embedding and retrieval models for RAG | 
| Custom AI apps | Exposes InferenceService endpoints for business use cases | 
AI Factory manages Model Serving through:
- Integrated Model Library for image management.
 - GPU resource management and scheduling.
 - Centralized observability and logging.
 - Seamless Hybrid Manager integration for lifecycle control.
 
Related topics
- Model Library Explained
 - Deploy AI Models
 - Deploy an InferenceService
 - Verify Model Deployments
 - KServe Official Docs
 - KServe GitHub Repository
 
Next steps
- Explore available models in your Model Library.
 - Deploy your first model using Model Serving.
 - Monitor deployed models through Hybrid Manager observability dashboards.
 - Explore advanced Model Serving capabilities:
 - Multi-modal pipelines
 - Transformers and Explainers
 - Scale-to-zero policies
 
Model Serving gives you a powerful foundation for building intelligent, governed AI applications — securely and scalably — as part of your Sovereign AI strategy with EDB PG AI.