Model Management on Hybrid Manager v1.3
Purpose and Benefits
Model management within Hybrid Manager provides centralized governance and deployment capabilities for AI models running on your Kubernetes infrastructure. This system enables organizations to maintain complete control over their AI capabilities while leveraging enterprise-grade Model Serving infrastructure.
The integration addresses critical requirements for organizations deploying AI at scale: model governance through approved registries, scalable inference serving with GPU acceleration, and unified management through Hybrid Manager's control plane. By running models within your controlled infrastructure, you maintain data sovereignty while accessing state-of-the-art AI capabilities.
Core Concepts
Model Library
The Model Library serves as your centralized governance system for AI model images. Operating within Hybrid Manager's Asset Library infrastructure, it provides a curated view of validated models ready for production deployment.
The library implements multi-stage governance:
- Automated synchronization from trusted container registries
- Security scanning and vulnerability assessment
- Approval workflows based on organizational policies
- Metadata management for versioning and documentation
Models in the library power all AI Factory capabilities including Gen AI assistants, Knowledge Base pipelines, and custom inference applications. Only models validated through the library's governance framework can reach production environments.
Model Serving
Model Serving transforms approved models into scalable inference endpoints using KServe within your Kubernetes clusters. This infrastructure provides production-grade model deployment with automatic scaling, health management, and resource optimization.
Key serving capabilities include:
- InferenceService resources that define deployed model endpoints
- ServingRuntime configurations optimized for different model frameworks
- GPU allocation and scheduling for high-performance inference (see Setup GPU and Update GPU resources)
- Internal and external endpoint access with authentication
Management Interface
Hybrid Manager provides unified management through its web console, abstracting Kubernetes complexity while maintaining full configurability. The interface enables:
- Visual workflows for model deployment from library to serving
- Resource allocation and scaling configuration
- Monitoring dashboards for inference metrics and GPU utilization
- Access control and endpoint management
Implementation Workflow
Model Registration
Organizations begin by configuring repository connections to trusted model sources. The Model Library synchronizes with external registries based on defined rules, automatically discovering and validating new model versions.
External Registry → Repository Rules → Security Scanning → Model Library
Repository rules determine which models enter your environment, implementing organizational policies at the point of ingestion. This automated approach reduces manual overhead while maintaining governance standards.
Model Deployment
Validated models deploy through guided workflows that configure serving infrastructure:
- Model Selection: Browse available models in the library with metadata including version, performance characteristics, and resource requirements
- Runtime Configuration: Select or create ServingRuntimes optimized for the model framework (vLLM, TensorRT-LLM, custom)
- Resource Allocation: Define GPU, memory, and CPU requirements based on expected workload
- Endpoint Configuration: Set up internal cluster access or external API endpoints with authentication (see Access KServe endpoints)
The system creates InferenceService resources that KServe manages, handling pod scheduling, health monitoring, and traffic routing automatically.
Operational Management
Deployed models operate under continuous monitoring with automatic scaling based on demand. Hybrid Manager provides visibility through:
- Real-time inference metrics including latency and throughput
- GPU utilization tracking for resource optimization
- Error rates and health status for proactive maintenance
- Cost analysis based on resource consumption
GPU Infrastructure
Resource Types
Hybrid Manager supports various GPU configurations to match workload requirements:
Development GPUs: Lower-tier GPUs (T4, RTX series) for testing and experimentation with time-slicing support for resource sharing Production GPUs: High-performance GPUs (A100, H100) with dedicated allocation for latency-sensitive inference workloads Multi-Instance GPUs: Partitioned GPUs enabling multiple smaller models to share hardware resources efficiently
Allocation Strategies
GPU scheduling occurs through Kubernetes device plugins that advertise available resources and enforce allocation policies. The system supports:
- Exclusive GPU allocation for production workloads
- Time-sliced sharing for development and testing
- MIG partitions for efficient multi-model serving
- Node affinity rules for workload placement
Resource quotas at the project level prevent GPU monopolization while ensuring fair access across teams. Automatic scaling adjusts GPU allocation based on inference demand, optimizing costs while meeting service requirements.
Integration Patterns
Gen AI Builder Applications
Gen AI Builder leverages model endpoints for assistants and agents. LLMs deployed through Model Serving provide text generation capabilities while embedding models support RAG pipelines.
Applications access models through cluster-local DNS:
http://model-name.namespace.svc.cluster.local/v1/chat/completions
Knowledge Base Pipelines
Knowledge bases utilize embedding models for document vectorization and semantic search. The Model Library ensures consistent model versions across ingestion and retrieval operations.
Pipeline integration benefits from:
- Validated embedding models with known performance characteristics
- Automatic model updates through library synchronization
- Consistent vector dimensions across knowledge base operations
Custom Applications
Organizations deploy specialized models for business-specific requirements. The serving infrastructure supports various model types including vision models, reranking models, and custom fine-tuned variants.
External applications access models through secured endpoints with API key authentication, enabling integration with existing business systems while maintaining security boundaries.
Security and Governance
Access Control
Role-based access control governs model operations at multiple levels:
- Library Access: Controls who can register and approve models
- Deployment Permissions: Manages who can deploy models to serving
- Endpoint Access: Determines who can invoke model endpoints
Integration with enterprise identity providers enables single sign-on while maintaining detailed audit trails for compliance requirements.
Model Governance
The governance framework ensures only validated models reach production:
- Security scanning identifies vulnerabilities before deployment
- Digital signatures verify model authenticity and integrity
- Approval workflows enforce multi-stakeholder validation
- Retention policies manage model lifecycle and deprecation
Data Protection
All model operations maintain data sovereignty within your infrastructure:
- Models run exclusively on your Kubernetes clusters
- Inference data never leaves your controlled environment
- Network policies enforce traffic isolation
- Encryption protects data in transit and at rest
Monitoring and Optimization
Performance Metrics
Comprehensive monitoring tracks model performance across multiple dimensions:
- Inference Latency: End-to-end request processing time
- Token Throughput: Generation rate for language models
- Batch Efficiency: Utilization of batching capabilities
- Cache Hit Rates: Effectiveness of response caching
These metrics inform optimization decisions including batch size tuning, resource allocation adjustments, and caching strategy refinements.
Resource Utilization
GPU monitoring provides visibility into hardware efficiency:
- Memory usage patterns identifying optimization opportunities
- Compute utilization indicating scaling requirements
- Power consumption for cost analysis
- Temperature monitoring for hardware health
Cost Management
The system tracks resource consumption for cost attribution:
- GPU hours per model and project
- Storage costs for model artifacts
- Network transfer for inference traffic
- Scaling patterns affecting resource costs
Best Practices
Model Selection
Choose models based on clear requirements analysis:
- Match model capabilities to use case requirements
- Consider resource constraints and cost implications
- Validate performance characteristics through testing
- Document model selection rationale for future reference
Deployment Strategy
Implement systematic deployment approaches:
- Start with conservative resource allocation
- Monitor actual usage before optimization
- Use canary deployments for model updates
- Maintain rollback capabilities for problematic deployments
Operational Excellence
Maintain high operational standards:
- Establish baseline performance metrics
- Implement comprehensive monitoring before production
- Document configuration decisions and procedures
- Regular review of resource utilization and costs
Getting Started
Begin your model management journey with these steps:
- Enable Model Capabilities in your Hybrid Manager project through Settings → Features
- Configure Repository Rules to connect trusted model sources
- Deploy Your First Model using the guided deployment workflow
- Monitor Performance through integrated dashboards
For detailed implementation guidance:
Model management within Hybrid Manager transforms AI deployment from experimental prototypes to production-grade services, maintaining sovereignty while delivering enterprise capabilities.