Model Serving Explained v1.2
Model Serving in AI Factory allows you to deploy AI models as scalable, production-grade inference services — running on your Kubernetes infrastructure.
It provides a Kubernetes-native architecture based on KServe, giving your models the ability to serve predictions and embeddings over network-accessible APIs.
AI Factory Model Serving is optimized to support enterprise-class AI workloads with:
- GPU-accelerated infrastructure
 - Flexible scaling
 - Integrated observability
 - Sovereign AI alignment — models run under your governance
 - Seamless integration with Gen AI Builder, Knowledge Bases, and other AI Factory pipelines
 
Before you start
Prerequisites for understanding Model Serving:
- Familiarity with Kubernetes basics
 - Understanding of KServe and InferenceService
 - Awareness of Model Library → Model Serving workflow in AI Factory
 - Understanding of Sovereign AI principles — models running under your governance
 
Suggested reading:
How Model Serving works
Core stack
| Layer | Purpose | 
|---|---|
| AI Factory | Provides infrastructure and Model Serving APIs | 
| Hybrid Manager Kubernetes Cluster | Hosts model-serving workloads | 
| KServe | Manages model serving lifecycle and APIs | 
| InferenceService | Deployed model resource | 
| Model Library | Manages model image versions | 
| GPU Nodes | Run high-performance model serving pods | 
| User Applications | Call model endpoints via REST/gRPC | 
Key components
- InferenceService — Kubernetes CRD representing a deployed model.
 - ServingRuntime / ClusterServingRuntime — Define reusable runtime configurations.
 - Model containers — Currently focused on NVIDIA NIM containers in AI Factory 1.2.
 - Observability — Integrated Prometheus-compatible metrics, Kubernetes logging.
 
Supported models
AI Factory Model Serving currently supports NVIDIA NIM containers for:
| Model Type | Example Usage | 
|---|---|
| Text Completion | LLM agents, Assistants | 
| Text Embeddings | Knowledge Bases, RAG | 
| Text Reranking | RAG pipelines | 
| Image Embeddings | Multi-modal search | 
| Image OCR | Document extraction | 
See: Supported Models
Deployment architecture
Applications → Model Endpoints (REST/gRPC) → KServe → GPU-enabled Kubernetes → Model Containers
- Each model is isolated in its own InferenceService.
 - KServe manages:
 - Model lifecycle (start, stop, update)
 - Scaling (including scale-to-zero)
 - Endpoint routing (REST/gRPC)
 - GPU resources are provisioned and scheduled via Hybrid Manager integration.
 
Patterns of use
Gen AI Builder
- LLM endpoints power Assistants and Agents.
 - Embedding models support hybrid RAG pipelines.
 
Knowledge Bases
- Embedding models serve vectorization needs.
 - Retrieval and reranking models power semantic search pipelines.
 
Custom applications
- Business applications can consume InferenceService endpoints for:
 - Real-time predictions
 - Image analysis
 - Text processing
 
Best practices
- Deploy models via the Model Library → Model Serving flow to ensure governance.
 - Use ClusterServingRuntime for reusable runtime configs.
 - Monitor GPU utilization and model latency closely.
 - Test scale-to-zero configurations for readiness in production.
 - Ensure Model Library tags are versioned and documented.
 - Regularly audit deployed InferenceServices as part of Sovereign AI governance.
 
Summary
Model Serving in AI Factory provides a robust, scalable architecture for serving production AI models:
- Kubernetes-native serving with KServe
 - GPU acceleration and optimized serving runtimes
 - Integrated observability and governance
 - Tight integration with AI Factory components: Gen AI Builder, Knowledge Bases, custom AI pipelines
 
Model Serving helps you implement Sovereign AI — with your models, on your infrastructure, under your control.
Next steps
- Deploy your first InferenceService
 - Verify deployed models
 - Deploy NVIDIA NIM Microservices
 - Explore Model Library
 
Model Serving gives you a powerful foundation for building intelligent applications and data products — securely, scalably, and under your governance — as part of EDB Postgres® AI.