Model Serving Explained v1.2

Suggest edits

Model Serving in AI Factory allows you to deploy AI models as scalable, production-grade inference services — running on your Kubernetes infrastructure.

It provides a Kubernetes-native architecture based on KServe, giving your models the ability to serve predictions and embeddings over network-accessible APIs.

AI Factory Model Serving is optimized to support enterprise-class AI workloads with:

GPU-accelerated infrastructure
Flexible scaling
Integrated observability
Sovereign AI alignment — models run under your governance
Seamless integration with Gen AI Builder, Knowledge Bases, and other AI Factory pipelines

Before you start

Prerequisites for understanding Model Serving:

Familiarity with Kubernetes basics
Understanding of KServe and InferenceService
Awareness of Model Library → Model Serving workflow in AI Factory
Understanding of Sovereign AI principles — models running under your governance

How Model Serving works

Core stack

Layer	Purpose
AI Factory	Provides infrastructure and Model Serving APIs
Hybrid Manager Kubernetes Cluster	Hosts model-serving workloads
KServe	Manages model serving lifecycle and APIs
InferenceService	Deployed model resource
Model Library	Manages model image versions
GPU Nodes	Run high-performance model serving pods
User Applications	Call model endpoints via REST/gRPC

Key components

InferenceService — Kubernetes CRD representing a deployed model.
ServingRuntime / ClusterServingRuntime — Define reusable runtime configurations.
Model containers — Currently focused on NVIDIA NIM containers in AI Factory 1.2.
Observability — Integrated Prometheus-compatible metrics, Kubernetes logging.

Supported models

AI Factory Model Serving currently supports NVIDIA NIM containers for:

Model Type	Example Usage
Text Completion	LLM agents, Assistants
Text Embeddings	Knowledge Bases, RAG
Text Reranking	RAG pipelines
Image Embeddings	Multi-modal search
Image OCR	Document extraction

See: Supported Models

Deployment architecture

Applications → Model Endpoints (REST/gRPC) → KServe → GPU-enabled Kubernetes → Model Containers

Each model is isolated in its own InferenceService.
KServe manages:
Model lifecycle (start, stop, update)
Scaling (including scale-to-zero)
Endpoint routing (REST/gRPC)
GPU resources are provisioned and scheduled via Hybrid Manager integration.

Patterns of use

Gen AI Builder

LLM endpoints power Assistants and Agents.
Embedding models support hybrid RAG pipelines.

Knowledge Bases

Embedding models serve vectorization needs.
Retrieval and reranking models power semantic search pipelines.

Custom applications

Business applications can consume InferenceService endpoints for:
Real-time predictions
Image analysis
Text processing

Best practices

Deploy models via the Model Library → Model Serving flow to ensure governance.
Use ClusterServingRuntime for reusable runtime configs.
Monitor GPU utilization and model latency closely.
Test scale-to-zero configurations for readiness in production.
Ensure Model Library tags are versioned and documented.
Regularly audit deployed InferenceServices as part of Sovereign AI governance.

Summary

Model Serving in AI Factory provides a robust, scalable architecture for serving production AI models:

Kubernetes-native serving with KServe
GPU acceleration and optimized serving runtimes
Integrated observability and governance
Tight integration with AI Factory components: Gen AI Builder, Knowledge Bases, custom AI pipelines

Model Serving helps you implement Sovereign AI — with your models, on your infrastructure, under your control.

Next steps

Model Serving gives you a powerful foundation for building intelligent applications and data products — securely, scalably, and under your governance — as part of EDB Postgres® AI.

← Prev

Model Serving Concepts

↑ Up

AI Factory Explained

Retrievers explained