Model Management on Hybrid Manager v1.3

Purpose and Benefits

Model management within Hybrid Manager provides centralized governance and deployment capabilities for AI models running on your Kubernetes infrastructure. This system enables organizations to maintain complete control over their AI capabilities while leveraging enterprise-grade Model Serving infrastructure.

The integration addresses critical requirements for organizations deploying AI at scale: model governance through approved registries, scalable inference serving with GPU acceleration, and unified management through Hybrid Manager's control plane. By running models within your controlled infrastructure, you maintain data sovereignty while accessing state-of-the-art AI capabilities.

Core Concepts

Model Library

The Model Library serves as your centralized governance system for AI model images. Operating within Hybrid Manager's Asset Library infrastructure, it provides a curated view of validated models ready for production deployment.

The library implements multi-stage governance:

  • Automated synchronization from trusted container registries
  • Security scanning and vulnerability assessment
  • Approval workflows based on organizational policies
  • Metadata management for versioning and documentation

Models in the library power all AI Factory capabilities including Gen AI assistants, Knowledge Base pipelines, and custom inference applications. Only models validated through the library's governance framework can reach production environments.

Model Serving

Model Serving transforms approved models into scalable inference endpoints using KServe within your Kubernetes clusters. This infrastructure provides production-grade model deployment with automatic scaling, health management, and resource optimization.

Key serving capabilities include:

Management Interface

Hybrid Manager provides unified management through its web console, abstracting Kubernetes complexity while maintaining full configurability. The interface enables:

  • Visual workflows for model deployment from library to serving
  • Resource allocation and scaling configuration
  • Monitoring dashboards for inference metrics and GPU utilization
  • Access control and endpoint management

Implementation Workflow

Model Registration

Organizations begin by configuring repository connections to trusted model sources. The Model Library synchronizes with external registries based on defined rules, automatically discovering and validating new model versions.

External Registry → Repository Rules → Security Scanning → Model Library

Repository rules determine which models enter your environment, implementing organizational policies at the point of ingestion. This automated approach reduces manual overhead while maintaining governance standards.

Model Deployment

Validated models deploy through guided workflows that configure serving infrastructure:

  1. Model Selection: Browse available models in the library with metadata including version, performance characteristics, and resource requirements
  2. Runtime Configuration: Select or create ServingRuntimes optimized for the model framework (vLLM, TensorRT-LLM, custom)
  3. Resource Allocation: Define GPU, memory, and CPU requirements based on expected workload
  4. Endpoint Configuration: Set up internal cluster access or external API endpoints with authentication (see Access KServe endpoints)

The system creates InferenceService resources that KServe manages, handling pod scheduling, health monitoring, and traffic routing automatically.

Operational Management

Deployed models operate under continuous monitoring with automatic scaling based on demand. Hybrid Manager provides visibility through:

  • Real-time inference metrics including latency and throughput
  • GPU utilization tracking for resource optimization
  • Error rates and health status for proactive maintenance
  • Cost analysis based on resource consumption

GPU Infrastructure

Resource Types

Hybrid Manager supports various GPU configurations to match workload requirements:

Development GPUs: Lower-tier GPUs (T4, RTX series) for testing and experimentation with time-slicing support for resource sharing Production GPUs: High-performance GPUs (A100, H100) with dedicated allocation for latency-sensitive inference workloads Multi-Instance GPUs: Partitioned GPUs enabling multiple smaller models to share hardware resources efficiently

Allocation Strategies

GPU scheduling occurs through Kubernetes device plugins that advertise available resources and enforce allocation policies. The system supports:

  • Exclusive GPU allocation for production workloads
  • Time-sliced sharing for development and testing
  • MIG partitions for efficient multi-model serving
  • Node affinity rules for workload placement

Resource quotas at the project level prevent GPU monopolization while ensuring fair access across teams. Automatic scaling adjusts GPU allocation based on inference demand, optimizing costs while meeting service requirements.

Integration Patterns

Gen AI Builder Applications

Gen AI Builder leverages model endpoints for assistants and agents. LLMs deployed through Model Serving provide text generation capabilities while embedding models support RAG pipelines.

Applications access models through cluster-local DNS:

http://model-name.namespace.svc.cluster.local/v1/chat/completions

Knowledge Base Pipelines

Knowledge bases utilize embedding models for document vectorization and semantic search. The Model Library ensures consistent model versions across ingestion and retrieval operations.

Pipeline integration benefits from:

  • Validated embedding models with known performance characteristics
  • Automatic model updates through library synchronization
  • Consistent vector dimensions across knowledge base operations

Custom Applications

Organizations deploy specialized models for business-specific requirements. The serving infrastructure supports various model types including vision models, reranking models, and custom fine-tuned variants.

External applications access models through secured endpoints with API key authentication, enabling integration with existing business systems while maintaining security boundaries.

Security and Governance

Access Control

Role-based access control governs model operations at multiple levels:

  • Library Access: Controls who can register and approve models
  • Deployment Permissions: Manages who can deploy models to serving
  • Endpoint Access: Determines who can invoke model endpoints

Integration with enterprise identity providers enables single sign-on while maintaining detailed audit trails for compliance requirements.

Model Governance

The governance framework ensures only validated models reach production:

  • Security scanning identifies vulnerabilities before deployment
  • Digital signatures verify model authenticity and integrity
  • Approval workflows enforce multi-stakeholder validation
  • Retention policies manage model lifecycle and deprecation

Data Protection

All model operations maintain data sovereignty within your infrastructure:

  • Models run exclusively on your Kubernetes clusters
  • Inference data never leaves your controlled environment
  • Network policies enforce traffic isolation
  • Encryption protects data in transit and at rest

Monitoring and Optimization

Performance Metrics

Comprehensive monitoring tracks model performance across multiple dimensions:

  • Inference Latency: End-to-end request processing time
  • Token Throughput: Generation rate for language models
  • Batch Efficiency: Utilization of batching capabilities
  • Cache Hit Rates: Effectiveness of response caching

These metrics inform optimization decisions including batch size tuning, resource allocation adjustments, and caching strategy refinements.

Resource Utilization

GPU monitoring provides visibility into hardware efficiency:

  • Memory usage patterns identifying optimization opportunities
  • Compute utilization indicating scaling requirements
  • Power consumption for cost analysis
  • Temperature monitoring for hardware health

Cost Management

The system tracks resource consumption for cost attribution:

  • GPU hours per model and project
  • Storage costs for model artifacts
  • Network transfer for inference traffic
  • Scaling patterns affecting resource costs

Best Practices

Model Selection

Choose models based on clear requirements analysis:

  • Match model capabilities to use case requirements
  • Consider resource constraints and cost implications
  • Validate performance characteristics through testing
  • Document model selection rationale for future reference

Deployment Strategy

Implement systematic deployment approaches:

  • Start with conservative resource allocation
  • Monitor actual usage before optimization
  • Use canary deployments for model updates
  • Maintain rollback capabilities for problematic deployments

Operational Excellence

Maintain high operational standards:

  • Establish baseline performance metrics
  • Implement comprehensive monitoring before production
  • Document configuration decisions and procedures
  • Regular review of resource utilization and costs

Getting Started

Begin your model management journey with these steps:

  1. Enable Model Capabilities in your Hybrid Manager project through Settings → Features
  2. Configure Repository Rules to connect trusted model sources
  3. Deploy Your First Model using the guided deployment workflow
  4. Monitor Performance through integrated dashboards

For detailed implementation guidance:

Model management within Hybrid Manager transforms AI deployment from experimental prototypes to production-grade services, maintaining sovereignty while delivering enterprise capabilities.