Frequently Asked Questions - AI Factory on Hybrid Manager v1.3

Table of Contents

Platform Capabilities

What types of models does HM 1.3 support?

Hybrid Manager 1.3 supports Large Language Model (LLM) deployments exclusively through NVIDIA NIM containers. Traditional machine learning models (classification, regression, time-series forecasting) are not supported in this release.

Supported NVIDIA NIM model categories:

  • Text Generation: Large language models for chat and completion tasks
  • Text Embeddings: Models for semantic search and RAG applications
  • Text Reranking: Models for search result optimization
  • Multimodal Models: Vision models including CLIP and OCR capabilities

Can I deploy custom models?

Custom models must be packaged as NVIDIA NIM containers to be compatible with HM 1.3. Standard machine learning frameworks (scikit-learn, XGBoost, TensorFlow for traditional ML) are not supported. Custom LLMs can be deployed if they conform to NIM container specifications and API standards.

See Private Registry Integration for custom NIM deployment procedures.

What distinguishes AI Factory from cloud AI services?

AI Factory provides complete sovereignty over AI operations:

  • Models execute within your Kubernetes infrastructure
  • Data remains within organizational boundaries
  • No external API dependencies for inference
  • Complete audit trails for regulatory compliance

Installation and Setup

What are the minimum infrastructure requirements?

Core Requirements

  • Kubernetes 1.27+ with NVIDIA GPU operator
  • NVIDIA GPUs compatible with NIM containers (L40S, A100, H100)
  • 100GB+ object storage for model artifacts
  • Network connectivity to NVIDIA NGC registry (or air-gapped configuration)

GPU Requirements by NIM Model Type

  • Text completion (Llama 3.3 70B): 4 x L40S GPUs
  • Text embeddings: 1 x L40S GPU
  • Text reranking: 1 x L40S GPU
  • Vision models: 1 x L40S GPU

Consult Prerequisites Guide for comprehensive specifications.

How do I configure GPU nodes for NIM models?

GPU node preparation involves:

  1. Install NVIDIA GPU operator on the cluster
  2. Label GPU nodes with nvidia.com/gpu=true
  3. Apply GPU taints for dedicated scheduling
  4. Verify CUDA compatibility for NIM requirements

Detailed instructions available in GPU Setup Documentation.

Can AI Factory operate in air-gapped environments?

Air-gapped deployments require advance preparation:

  1. Mirror NVIDIA NIM images to private registry
  2. Download and cache model profiles
  3. Upload profiles to object storage
  4. Configure Model Library for private registry access

Complete procedures documented in Air-Gap Configuration.

Model Management

How do I deploy NVIDIA NIM models?

NIM model deployment workflow:

  1. Access Model Library in HM console
  2. Select NVIDIA NIM model from catalog
  3. Configure resources (GPU allocation, memory, replicas)
  4. Deploy InferenceService to project namespace
  5. Access through generated endpoints

Step-by-step guide: Create InferenceService.

Which NVIDIA NIM models are available by default?

Default NIM models in HM 1.3:

  • llama-3.3-nemotron-super-49b: Advanced reasoning and chat
  • llama-3.2-nemoretriever-300m-embed: Text embeddings
  • llama-3.2-nv-rerankqa-1b: Query-document reranking
  • nvclip: Multimodal embeddings
  • paddleocr: Optical character recognition

How do I manage NIM model versions?

Version management strategies:

  • Model Library maintains version tags for each NIM image
  • Blue-green deployments enable zero-downtime updates
  • Canary deployments allow gradual traffic shifting
  • Rollback through InferenceService configuration updates

Gen AI Builder

Where do I find Gen AI documentation?

Primary Gen AI resources:

How do I create knowledge bases for RAG?

Knowledge base creation process:

  1. Configure data sources (databases, documents, APIs)
  2. Process content through NIM embedding models
  3. Store vectors in PostgreSQL with pgvector
  4. Configure retrieval strategies for search

Implementation guide: Knowledge Base Creation.

What are assistants and how do they work?

Assistants orchestrate interactions between users, knowledge bases, and external systems. They leverage NIM models for generation while maintaining conversation context through threads. Assistants differ from simple chatbots by incorporating retrieval, tool use, and structured reasoning capabilities.

Operations and Maintenance

How do I monitor NIM model performance?

Monitoring encompasses:

Metrics Collection

  • Prometheus metrics for inference latency
  • GPU utilization and memory consumption
  • Token generation throughput
  • Request success rates

Visualization

  • Grafana dashboards integrated in HM console
  • Custom panels for model-specific metrics
  • Alert configuration for SLA breaches

Reference: Model Observability.

How should I handle NIM model updates?

Update procedure for production deployments:

  1. Validation: Deploy new version in development namespace
  2. Testing: Execute performance and accuracy tests
  3. Deployment: Implement canary or blue-green strategy
  4. Monitoring: Track metrics during transition
  5. Decision: Complete rollout or rollback based on metrics

Troubleshooting

NIM model fails to start - diagnostic steps?

Common initialization failures:

  1. GPU unavailability: Verify GPU resources match model requirements
  2. Image pull failures: Check NGC credentials and network connectivity
  3. Profile cache missing: Ensure profiles available in air-gapped setups
  4. Insufficient memory: Validate memory allocation for model size

Diagnostic commands:

kubectl describe inferenceservice <name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
kubectl get events -n <namespace>

High inference latency - optimization strategies?

Performance optimization approaches:

  • Batch processing: Increase batch size for throughput optimization
  • Model quantization: Use INT8 quantization where supported
  • Response caching: Cache frequent queries at application layer
  • Horizontal scaling: Deploy additional replicas for load distribution

Poor retrieval quality in RAG applications?

Retrieval troubleshooting:

  1. Embedding quality: Verify appropriate NIM embedding model selection
  2. Document chunking: Adjust chunk size and overlap parameters
  3. Search parameters: Tune top-k and similarity thresholds
  4. Index completeness: Confirm all documents processed successfully

Security and Compliance

How do I implement access control?

Role-based access control for AI resources involves:

  • Kubernetes RBAC for namespace and resource permissions
  • Model Library access controls for deployment authorization
  • API key management for external endpoint access
  • Network policies for inter-service communication

What encryption is implemented?

Encryption coverage:

  • At rest: Kubernetes secrets encryption, database encryption
  • In transit: TLS for API calls, mTLS within service mesh
  • Model artifacts: Encrypted object storage
  • Knowledge bases: Encrypted vector storage in PostgreSQL

Which operations are audited?

Audit logging captures:

  • NIM model deployment and configuration changes
  • Inference requests (configurable detail level)
  • Knowledge base queries and updates
  • Assistant conversations via thread tracking
  • Administrative operations on AI resources

Performance and Scaling

How do I configure resource quotas?

Resource quotas prevent resource exhaustion at the namespace level. Configure GPU quotas, memory limits, and storage constraints based on project requirements and available infrastructure capacity.

When should I scale horizontally versus vertically?

Horizontal Scaling (additional replicas):

  • High concurrent request volume
  • Stateless inference workloads
  • Load distribution requirements

Vertical Scaling (increased resources per instance):

  • Large model memory requirements
  • Batch processing optimization
  • Single-request latency minimization

What model sizes can HM 1.3 support?

Model size constraints:

  • Single GPU: Models up to 13B parameters
  • Multi-GPU: Models up to 70B+ parameters using tensor parallelism
  • Memory limits: 80GB (A100), 48GB (L40S) per GPU

NVIDIA NIM handles model sharding and parallelism automatically based on available resources.

Additional Resources

Getting Started

Implementation Guides

Troubleshooting Resources

For issues not addressed here, contact EDB support or consult the AI Factory Hub.