Model Serving Concepts v1.2

Suggest edits

Model Serving is a core capability of EDB Postgres® AI (EDB PG AI), enabling scalable, flexible, and high-performance serving of AI/ML models on Kubernetes.

It powers:

Gen AI applications
Intelligent retrieval systems
Advanced data pipelines

Model Serving is implemented using KServe, an open-source Kubernetes-native engine for standardized model inferencing. AI Factory integrates KServe with Hybrid Manager to provide enterprise-grade lifecycle management, security, and observability.

Key to Sovereign AI: Models run on your Kubernetes clusters, under your control, with full observability and governance.

Before you start

Prerequisites for understanding Model Serving:

Familiarity with Kubernetes basics (pods, services, deployments)
Understanding of InferenceService as a Kubernetes CRD
Awareness of Model Library and how models are registered and deployed
Understanding of Sovereign AI principles in EDB PG AI

Suggested path:

Why it matters

Model Serving enables your AI models and Postgres data to work together seamlessly — securely and scalably — under Sovereign AI principles:

Deploy open-source or commercial models to your Kubernetes cluster.
Serve Gen AI models for Assistants, Knowledge Bases, and RAG pipelines.
Support multi-modal retrieval (text, image, hybrid search).
Optimize performance with GPU acceleration and server-side batching.
Maintain full observability, auditing, and governance over model usage.

See also: Hybrid Manager Model Serving integration in production environments.

Core concepts

InferenceService (via KServe)

At the core of Model Serving is the InferenceService — a Kubernetes-native resource that represents a deployed model.

It defines the end-to-end serving pipeline:

Predictor — Runs the model server and handles inference.
Transformer (optional) — Applies pre-processing or post-processing.
Explainer (optional) — Provides model explainability outputs.

Predictor

The Predictor defines:

Model format — PyTorch, TensorFlow, ONNX, Triton, etc.
Model location — S3-compatible storage, OCI registry, PVC.
Resources — CPU, memory, GPU.
Autoscaling — Policies for elastic scaling, including scale-to-zero.

ServingRuntime / ClusterServingRuntime

Reusable runtime definitions for serving:

ServingRuntime — Namespace-scoped.
ClusterServingRuntime — Cluster-wide reusable runtimes.

Benefits:

Tailor runtime settings to model type and hardware.
Standardize runtime configurations across teams and projects.

How it works

Lifecycle flow

Register model image in Model Library.
Deploy model via Model Serving console or CLI → creates InferenceService.
AI Factory + KServe provision Kubernetes resources (pods, services).
Model is loaded into runtime container.
Kubernetes service endpoint is exposed.
Clients send HTTP/gRPC inference requests.
Requests may pass through Transformers and Explainers.
Inference response is returned.

Key features

Multi-framework support — PyTorch, TensorFlow, ONNX, XGBoost, Triton, and more.
GPU acceleration — Native NVIDIA GPU support.
Autoscaling — Including scale-to-zero via Knative.
Observability — Prometheus metrics, Kubernetes logging, and Hybrid Manager dashboards.
Batching — Server-side batching for improved throughput.
Explainability — Support for model explainability tooling.
Security and Sovereign AI — Models run on your Kubernetes clusters under your control.

Patterns of use

Gen AI Builder

Serve LLMs and multi-modal models powering Assistants and Agents.

Knowledge Bases

Serve embedding and retrieval models used in:
Knowledge Base indexing
RAG pipelines

Custom applications

Expose InferenceService endpoints to:
Business applications
Microservices
ETL/ELT pipelines

Hybrid + Sovereign AI alignment

All models run on your infrastructure via Hybrid Manager KServe layer.
You control:
Which models are deployed
Resource allocations
Deployment topology
Observability and auditing

Best practices

Always deploy models through the Model Library → Model Serving flow to ensure governance.
Monitor resource consumption — especially GPU utilization.
Test scale-to-zero policies carefully before production use.
Use ServingRuntime and ClusterServingRuntime templates for consistency.
Tag and document production models clearly in Model Library.
Audit InferenceService deployments regularly — critical for Sovereign AI.

In AI Factory

Model Serving powers multiple components in EDB PG AI:

Component	How it uses Model Serving
Gen AI Builder	Runs LLMs and specialized models for Assistants
Knowledge Bases	Serves embedding and retrieval models for RAG
Custom AI apps	Exposes InferenceService endpoints for business use cases

AI Factory manages Model Serving through:

Integrated Model Library for image management.
GPU resource management and scheduling.
Centralized observability and logging.
Seamless Hybrid Manager integration for lifecycle control.

Next steps

Explore available models in your Model Library.
Deploy your first model using Model Serving.
Monitor deployed models through Hybrid Manager observability dashboards.
Explore advanced Model Serving capabilities:
Multi-modal pipelines
Transformers and Explainers
Scale-to-zero policies

Model Serving gives you a powerful foundation for building intelligent, governed AI applications — securely and scalably — as part of your Sovereign AI strategy with EDB PG AI.

← Prev

AI Factory Model Library explained

↑ Up

AI Factory Explained

Model Serving Explained