Gen AI Factory LLM Architecture (Library and Serving Overview) v1.3
Understanding EDB AI Factory's Model Library and LLM Serving Components
This document provides a conceptual framework for understanding how Model Library and Model Serving components work together within AI Factory to deliver enterprise-grade Large Language Model (LLM) inference capabilities. Organizations implementing AI Factory require this architectural understanding to make informed decisions about LLM deployment strategies, resource allocation, and governance frameworks that align with their operational requirements and compliance obligations.
Purpose and Scope
AI Factory's LLM serving architecture enables organizations to deploy and manage Large Language Models at scale while maintaining governance, security, and operational control. The system addresses the critical gap between experimental LLM development and production-ready inference services, where traditional approaches often fail to meet enterprise requirements for reliability, security, and compliance. Organizations benefit from reduced operational complexity, improved GPU resource utilization, and consistent policy enforcement across their entire LLM lifecycle. The architecture bridges the gap between LLM development and production deployment through two primary components: the Model Library for curation and the Model Serving infrastructure for runtime operations.
Architectural Principles
Separation of Concerns
The architecture distinguishes between LLM storage/curation (Model Library) and LLM execution (Model Serving), enabling independent scaling and management of each concern. This separation allows platform teams to optimize storage strategies without impacting running inference services, while AI teams can focus on LLM performance without worrying about underlying infrastructure complexity. Organizations experience improved system stability because failures in one component don't cascade to others, and teams can implement changes to curation policies or serving configurations independently based on their specific operational cycles.
Governance Integration
Security and compliance controls are embedded throughout the LLM pipeline rather than added as afterthoughts, ensuring consistent policy enforcement from image ingestion to production inference. This approach eliminates the common problem of governance gaps that occur when security measures are applied retroactively to existing systems. Platform administrators benefit from centralized policy management that automatically applies to all LLMs, while compliance teams gain comprehensive audit trails and enforcement mechanisms that meet regulatory requirements without manual intervention.
Resource Optimization
GPU resources and compute infrastructure are abstracted from individual LLM deployments, allowing efficient resource sharing and dynamic allocation based on workload demands. This abstraction prevents the common scenario where expensive GPU resources remain underutilized because they're statically allocated to specific LLMs. Organizations achieve significant cost savings through improved resource utilization while maintaining performance guarantees, and development teams benefit from simplified deployment processes that don't require detailed infrastructure knowledge.
Control and Data Flow
Container Registry → Image & Model Library → Model Library (curated) ↓ KServe (ServingRuntime) ↓ InferenceService endpoint (internal & external access)
Flow Components
- LLM Image Synchronization: LLM container images are synchronized from registries into the Image & Model Library based on repository rules that define which repositories, tags, and LLM types to ingest
- Curation Process: Model Library exposes a curated subset of those LLM images for serving through governed selection processes
- Deployment Configuration: Operators select a ServingRuntime optimized for LLM frameworks and deploy an InferenceService with the chosen LLM image
- Access Management: Applications invoke LLMs via internal DNS for cluster communication or external portal endpoints with access keys, depending on network and security requirements
Core Components
Image & Model Library
The foundational registry system that synchronizes LLM container images from external sources and maintains a centralized catalog of available language models. This component serves as the authoritative source for all LLM images within the organization, eliminating the chaos that typically results from teams maintaining separate, disconnected model repositories. Platform teams gain complete visibility into LLM inventory across the organization, while AI teams benefit from streamlined access to approved models without navigating complex approval processes for each deployment.
The Image & Model Library continuously monitors configured external registries and automatically ingests new LLM versions based on defined criteria, ensuring teams always have access to the latest approved language models without manual intervention. This automation reduces the operational burden on platform teams while ensuring consistent application of organizational policies across all ingested content.
Key Functions:
- Automated synchronization from configured container registries reduces manual overhead while ensuring consistency across LLM versions and sources
- Repository rule enforcement for selective LLM ingestion prevents unauthorized or non-compliant models from entering the organization's deployment pipeline
- Metadata management and version tracking provides complete lineage information that supports both operational troubleshooting and compliance reporting requirements
- Security scanning and vulnerability assessment automatically identifies potential risks before LLMs reach production environments
Integration Points:
Model Library (Curated View)
A governed subset of the Image & Model Library that presents only approved LLMs for production deployment, implementing organizational policies and compliance requirements. This curated view transforms the comprehensive LLM catalog into a focused collection of language models that meet specific organizational standards for security, performance, and compliance. Development teams benefit from simplified LLM selection because they can choose from any model in the curated library with confidence that it meets all organizational requirements.
The curation process acts as a quality gate that prevents problematic LLMs from reaching production while maintaining development velocity for teams working with approved models. Platform administrators can implement sophisticated approval workflows that automatically evaluate LLMs against multiple criteria, reducing manual review burden while maintaining strict quality standards.
Governance Controls:
- Approval workflows for LLM promotion ensure that only models meeting organizational standards reach production environments, reducing risk while maintaining development agility
- Security policy enforcement automatically applies consistent security standards across all LLMs without requiring individual team expertise in security best practices
- Performance benchmarking requirements establish minimum performance thresholds that ensure production LLMs meet operational requirements before deployment
- Compliance validation checks automatically verify that LLMs meet industry-specific regulatory requirements, reducing compliance risk and audit preparation overhead
Related Documentation:
LLM Serving Infrastructure
KServe-based deployment system that transforms approved LLM images into scalable inference endpoints with enterprise-grade operational characteristics. This infrastructure handles the complex orchestration required to deploy Large Language Models reliably at scale, including health monitoring, auto-scaling, and GPU resource management. Development teams benefit from simplified deployment processes that abstract away infrastructure complexity, while operations teams gain comprehensive monitoring and management capabilities for all deployed LLMs.
The serving infrastructure provides consistent operational characteristics across different LLM types and frameworks, ensuring that teams can apply standard operational procedures regardless of the underlying model architecture. This consistency reduces operational complexity while improving reliability through standardized monitoring, logging, and recovery procedures.
Core Resources:
- ServingRuntime: LLM-specific execution environments (vLLM, NIM, TensorRT-LLM) provide optimized runtime characteristics for different language model types while abstracting framework-specific configuration complexity from deployment teams
- InferenceService: Production LLM endpoints with auto-scaling capabilities automatically adjust resource allocation based on demand patterns, ensuring optimal performance while minimizing costs
- Resource Management: GPU allocation and compute optimization systems ensure efficient utilization of expensive hardware resources across multiple LLMs and workloads
Implementation Guides:
Access Management
Multi-layered access control system that provides both internal cluster communication and external API access while maintaining security boundaries. This system addresses the complex security requirements of production LLM systems where models may need to serve both internal microservices and external applications with different security profiles. Platform administrators benefit from centralized access control that consistently applies security policies across all access patterns, while development teams gain flexible access options that support diverse integration requirements.
The access management system provides granular control over who can access which LLMs under what circumstances, supporting complex organizational requirements while maintaining operational simplicity. This approach eliminates common security gaps that occur when access controls are applied inconsistently across different access methods.
Access Patterns:
- Internal Access: Cluster-local DNS for service-to-service communication provides high-performance, low-latency access for applications running within the same infrastructure while maintaining network-level security isolation
- External Access: Portal-based endpoints with authentication and authorization enable secure integration with external applications and third-party systems while maintaining comprehensive audit trails and access control
Configuration Details:
System Flow
LLM Ingestion Pipeline
LLM container images flow from external registries into the Image & Model Library through automated synchronization processes governed by repository rules. These rules determine which repositories, namespaces, and tags to monitor and ingest, providing organizations with precise control over their LLM pipeline while automating routine synchronization tasks. Platform teams benefit from reduced manual overhead in managing LLM updates, while maintaining strict control over which models enter their environment. AI teams gain access to approved language models more quickly because the automated pipeline eliminates manual approval bottlenecks for models that meet predefined criteria.
The ingestion pipeline provides comprehensive logging and monitoring that enables platform teams to track LLM provenance and identify potential issues before they impact production systems. This visibility supports both operational troubleshooting and compliance reporting requirements.
Curation Process
The Model Library applies organizational governance policies to filter the complete LLM catalog, presenting only approved models for deployment. This curation process includes security scanning, performance validation, and compliance verification that automatically evaluates LLMs against organizational standards without requiring manual review for every model version. The process reduces time-to-production for compliant LLMs while maintaining strict quality gates that prevent problematic models from reaching production environments.
Organizations benefit from consistent application of governance policies across all LLMs, regardless of their source or development team. This consistency reduces compliance risk while enabling teams to move quickly with models that meet organizational standards.
Deployment Workflow
Approved LLMs are deployed through the Model Serving infrastructure, where ServingRuntime configurations define the execution environment and InferenceService specifications determine scaling behavior and resource requirements. This workflow abstracts the complexity of production deployment while providing comprehensive control over operational characteristics. Development teams benefit from simplified deployment processes that don't require deep infrastructure expertise, while platform teams maintain control over resource allocation and operational policies.
The deployment workflow provides automated validation and rollback capabilities that reduce deployment risk while maintaining development velocity. Teams can deploy LLMs with confidence because the system automatically verifies operational readiness before directing production traffic to new deployments.
Runtime Operations
Production LLM inference endpoints operate under continuous monitoring with automated scaling, health checking, and performance optimization. The system maintains operational metrics and provides observability into model performance and resource utilization, enabling proactive management of production systems. Operations teams benefit from comprehensive visibility into system health and performance, while development teams gain insights into LLM behavior under production conditions.
The runtime operations framework provides automated recovery capabilities that minimize service disruptions while maintaining detailed audit trails for troubleshooting and compliance purposes. This automation reduces operational burden while improving system reliability through consistent application of operational best practices.
Resource Management
GPU Infrastructure for LLMs
The architecture abstracts GPU resources from individual LLM deployments, enabling efficient resource sharing and dynamic allocation based on demand patterns. This abstraction addresses the common challenge of GPU underutilization that occurs when resources are statically allocated to specific LLMs, regardless of actual demand. Organizations achieve significant cost savings through improved resource utilization while maintaining performance guarantees for production workloads.
Platform teams benefit from centralized resource management that optimizes utilization across all deployed LLMs, while development teams gain access to GPU resources without needing to understand complex allocation strategies. The system automatically handles resource scheduling and allocation based on workload characteristics and organizational priorities.
Setup Requirements:
LLM Scaling Strategies
Auto-scaling capabilities adjust compute resources based on traffic patterns and performance requirements, ensuring cost-effective operations while maintaining service levels. The scaling system learns from historical patterns and proactively adjusts resources to meet demand, reducing both costs and latency for end users. Organizations benefit from optimized resource costs without compromising performance, while development teams can focus on LLM development without worrying about operational scaling concerns.
The scaling strategies support both predictable workload patterns and sudden traffic spikes, ensuring consistent performance across diverse usage scenarios. This flexibility enables organizations to serve LLMs efficiently regardless of demand variability.
Performance Optimization for LLMs
The system incorporates LLM-specific optimizations including batch processing, quantization, and tensor parallelism to maximize throughput and minimize latency. These optimizations are applied automatically based on model characteristics and workload patterns, ensuring optimal performance without requiring specialized expertise from development teams. Organizations benefit from improved LLM performance and reduced infrastructure costs through efficient resource utilization.
Performance optimizations are continuously evaluated and adjusted based on actual workload characteristics, ensuring that optimization strategies evolve with changing requirements and model characteristics. This adaptive approach maintains optimal performance as workloads and LLMs evolve over time.
Security and Governance
Access Control Framework
Multi-tiered security model that enforces authentication and authorization at multiple levels, from registry access through production inference endpoints. This framework addresses the complex security requirements of production LLM systems where different users and applications require different levels of access to various models and capabilities. Platform administrators benefit from centralized policy management that consistently applies security controls across all system components, while maintaining flexibility to support diverse access requirements.
The access control framework provides comprehensive audit trails that support compliance reporting and security monitoring, enabling organizations to maintain security visibility across their entire LLM infrastructure. This visibility supports both operational security monitoring and compliance reporting requirements.
Compliance Integration
Built-in compliance controls ensure that LLM deployments meet organizational and regulatory requirements throughout the entire lifecycle. These controls are integrated into every stage of the model pipeline, from ingestion through production deployment, ensuring consistent compliance without impacting development velocity. Organizations benefit from automated compliance verification that reduces manual audit overhead while maintaining comprehensive compliance coverage.
The compliance integration provides detailed documentation and audit trails that support regulatory reporting requirements while enabling organizations to demonstrate compliance to auditors and regulators. This documentation reduces compliance burden while providing comprehensive evidence of control effectiveness.
Audit and Monitoring
Comprehensive logging and monitoring capabilities provide visibility into system operations, LLM performance, and security events. This monitoring framework enables proactive identification of issues before they impact production systems while providing detailed forensic capabilities for troubleshooting and compliance investigations. Operations teams benefit from comprehensive system visibility that supports both day-to-day operations and incident response, while compliance teams gain detailed audit trails that support regulatory requirements.
The monitoring system provides both real-time alerting for immediate issues and historical analysis capabilities that support capacity planning and performance optimization. This dual approach ensures both operational responsiveness and strategic planning capabilities.
Monitoring Setup:
Integration Patterns
Development Workflow
The system integrates with existing development processes through SDK support and API compatibility, enabling seamless integration with application development lifecycles. This integration reduces friction for development teams while maintaining the benefits of centralized LLM management and governance. Development teams can incorporate LLM capabilities into their applications using familiar tools and patterns, while benefiting from enterprise-grade operational capabilities provided by the platform.
Integration with development workflows supports both experimental development and production deployment through consistent interfaces that scale from prototype to production without requiring significant changes to application code or deployment processes.
Getting Started:
API Compatibility
Standard API interfaces ensure compatibility with existing LLM toolchains while providing enterprise-grade capabilities for production workloads. This compatibility enables organizations to leverage existing investments in tools and processes while gaining the benefits of centralized LLM management and governance. Development teams can use familiar APIs and tools while benefiting from improved reliability, security, and operational capabilities provided by the platform.
API compatibility extends to both LLM training workflows and application integration patterns, ensuring that teams can adopt the platform without significant changes to existing processes and tools.
Advanced Capabilities
Multi-LLM Framework Support
The architecture accommodates various LLM frameworks and runtime environments, enabling organizations to deploy diverse language model types within a unified infrastructure. This flexibility prevents vendor lock-in while enabling teams to choose optimal frameworks for specific use cases without fragmenting operational processes. Organizations benefit from consistent operational procedures across different LLM types while maintaining flexibility to adopt new frameworks and technologies as they mature.
Multi-framework support includes optimized runtime configurations for different LLM types that ensure optimal performance characteristics regardless of underlying framework technology. This optimization reduces the expertise required from development teams while ensuring optimal resource utilization and performance.
Specialized Deployments:
Custom Runtime Configuration
Organizations can define custom ServingRuntime configurations to support specialized LLM requirements or optimization strategies. This customization capability enables support for unique organizational requirements while maintaining the benefits of standardized operational processes. Platform teams can create specialized runtime configurations that optimize for specific LLM characteristics or business requirements while ensuring consistent operational behavior across all deployments.
Custom runtime configurations support both standard optimization patterns and organization-specific requirements that may not be addressed by default configurations. This flexibility ensures that the platform can adapt to diverse organizational needs while maintaining operational consistency.
Next Steps
Understanding this architectural overview provides the foundation for implementing LLM serving capabilities within your organization. The referenced implementation guides provide detailed instructions for specific configuration tasks and operational procedures that translate these architectural concepts into working systems. Organizations should begin with the foundational components and progress through advanced capabilities based on their specific operational maturity and business requirements.
For hands-on experience with the system, begin with the quickstart guides and progress through the detailed configuration documentation based on your specific deployment requirements. This progression ensures that implementation efforts align with architectural principles while building practical expertise with system capabilities and operational procedures.