Troubleshooting AI Factory on Hybrid Manager v1.3
Diagnostic Overview
This guide provides systematic troubleshooting procedures for AI Factory components within Hybrid Manager environments. Issues typically fall into three categories: infrastructure problems, model serving failures, and application-level errors.
Quick Reference Links
Infrastructure Issues
GPU Resource Problems
Symptoms
- InferenceService pods remain pending
- "Insufficient nvidia.com/gpu" events
- Model initialization timeouts
Diagnostic Steps
# Check GPU node availability kubectl get nodes -l nvidia.com/gpu=true # Verify GPU resource allocation kubectl describe node <gpu-node-name> | grep -A 5 "Allocated resources" # Check GPU driver status kubectl logs -n gpu-operator nvidia-driver-daemonset-<pod>
Common Resolutions
Missing GPU Labels Nodes with GPUs must be properly labeled:
kubectl label nodes <node-name> nvidia.com/gpu=true
GPU Taint Issues Verify taint configuration for dedicated GPU scheduling:
kubectl taint nodes <node-name> nvidia.com/gpu=NoSchedule
Driver Compatibility
Ensure NVIDIA driver version matches CUDA requirements for NIM containers. Check driver logs in the gpu-operator
namespace for initialization errors.
Storage Access Failures
Symptoms
- Model download failures
- "ImagePullBackOff" status
- Profile cache errors in air-gapped environments
Diagnostic Procedures
# Check secret configuration kubectl get secret nvidia-nim-secrets -n default -o yaml # Verify image pull secret kubectl get secret ngc-cred -n <namespace> -o yaml # Test registry connectivity kubectl run test-pull --image=nvcr.io/nim/nvidia/nvclip:latest --dry-run=client
Resolution Strategies
Registry Authentication Recreate NGC credentials if authentication fails:
kubectl delete secret ngc-cred -n default kubectl create secret docker-registry ngc-cred \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=<NGC_API_KEY> \ -n default
Air-Gapped Environments Verify profile cache availability in object storage and correct path configuration in model deployment specifications.
Model Serving Failures
InferenceService Not Ready
Symptoms
- InferenceService shows "NotReady" status
- Predictor pods crash or restart
- Health check failures
Investigation Commands
# Check InferenceService status kubectl get inferenceservice <name> -n <namespace> # Examine detailed conditions kubectl describe inferenceservice <name> -n <namespace> # Review pod logs kubectl logs <predictor-pod> -n <namespace> -c kserve-container
Common Causes and Fixes
Insufficient Memory NIM models require substantial memory. Check pod resource requests:
kubectl describe pod <predictor-pod> -n <namespace> | grep -A 3 "Requests"
Increase memory allocation in InferenceService specification if necessary.
Model Loading Timeout Large models may exceed default initialization timeouts. Adjust readiness probe settings in the InferenceService configuration.
Profile Mismatch Ensure cached profiles match GPU architecture. List compatible profiles and verify cache contents.
High Inference Latency
Symptoms
- Response times exceed SLA requirements
- Token generation rates below expectations
- GPU underutilization
Performance Analysis
# Monitor GPU utilization kubectl exec <predictor-pod> -n <namespace> -- nvidia-smi # Check request metrics kubectl port-forward -n <namespace> svc/<service-name> 9090:9090 # Access metrics at localhost:9090/metrics
Optimization Approaches
Batch Size Tuning Adjust batch processing parameters in ServingRuntime configuration for improved throughput.
Replica Scaling Add InferenceService replicas to distribute load:
kubectl scale inferenceservice <name> -n <namespace> --replicas=3
Resource Allocation Verify GPU memory allocation matches model requirements. Insufficient GPU memory forces CPU fallback, severely impacting performance.
Gen AI Application Issues
Knowledge Base Retrieval Failures
Symptoms
- Empty or irrelevant search results
- Embedding generation errors
- Vector database connection failures
Diagnostic Process
# Check embedding model status kubectl get inferenceservice <embedding-model> -n <namespace> # Verify PostgreSQL connectivity kubectl exec -it <app-pod> -n <namespace> -- pg_isready -h <postgres-host> # Review application logs kubectl logs <gen-ai-app-pod> -n <namespace>
Resolution Steps
Embedding Model Issues Verify embedding model deployment and endpoint accessibility. Ensure consistent embedding dimensions between indexing and retrieval.
Database Connectivity Check PostgreSQL service availability and pgvector extension installation. Verify connection credentials in application configuration.
Assistant Response Errors
Symptoms
- Assistant fails to generate responses
- Tool invocation failures
- Thread state inconsistencies
Investigation Methods
Access thread logs for detailed execution traces:
# Query thread history (application-specific) kubectl exec <app-pod> -n <namespace> -- cat /app/logs/threads.log
Common Resolutions
Model Endpoint Unavailable Verify LLM InferenceService status and endpoint configuration in assistant settings.
Tool Integration Failures Check external API connectivity and authentication. Verify tool definitions match API specifications.
Context Window Exceeded Reduce retrieval chunk size or implement context management strategies for large documents.
Log Analysis
Log Locations
AI Factory components generate logs at multiple levels:
System Logs
- Kubernetes events:
kubectl get events -n <namespace>
- Node logs:
/var/log/messages
or journalctl on GPU nodes
Application Logs
- InferenceService: Predictor pod container logs
- Gen AI applications: Application pod logs
- Model Library: HM control plane logs
Metrics and Monitoring
- Prometheus metrics: Available through HM monitoring stack
- Custom dashboards: Accessible via Grafana in HM console
Log Aggregation
Configure centralized logging for comprehensive analysis:
# Stream logs from multiple components kubectl logs -f -l app=inference-service -n <namespace> --all-containers # Export logs for analysis kubectl logs <pod> -n <namespace> --since=1h > diagnostic.log
Alert Configuration
Critical Alerts
Configure monitoring alerts for critical conditions:
GPU Availability Alert when GPU nodes become unavailable or GPU allocation fails.
Model Health Monitor InferenceService readiness and restart frequency.
Performance Degradation Track inference latency percentiles and token generation rates.
Resource Exhaustion Alert on memory pressure, GPU memory saturation, or storage capacity.
Alert Integration
Integrate alerts with organizational notification systems through HM alert manager configuration.
Escalation Procedures
Support Resources
When internal troubleshooting proves insufficient:
- Documentation Review
- Consult AI Factory Hub
- Review model-specific documentation
- Community Resources
- NVIDIA NIM documentation for model-specific issues
- KServe community for serving infrastructure problems
- EDB Support
- Collect diagnostic bundles using HM support tools
- Include relevant logs, configurations, and error messages
- Reference specific component versions and deployment specifications
Diagnostic Information Collection
Prepare comprehensive diagnostic information:
# Generate support bundle kubectl cluster-info dump --output-directory=./cluster-dump # Collect AI Factory specifics kubectl get inferenceservice -A -o yaml > inferenceservices.yaml kubectl get pods -A -l serving.kserve.io/inferenceservice -o wide > serving-pods.txt kubectl describe nodes -l nvidia.com/gpu=true > gpu-nodes.txt
Preventive Measures
Regular Health Checks
Implement proactive monitoring:
- Weekly GPU driver and operator status verification
- Daily InferenceService health assessment
- Continuous performance baseline tracking
Capacity Planning
Monitor resource trends:
- GPU utilization patterns
- Memory consumption growth
- Storage usage projections
- Request volume trends
Update Management
Maintain component currency:
- Track NVIDIA NIM model updates
- Monitor security advisories
- Plan maintenance windows for updates
- Test updates in non-production environments