Deploy AI Factory with Hybrid Manager UI v1.3

Overview

This guide provides deployment procedures for AI Factory components through the Hybrid Manager (HCP) web interface. The deployment process covers GPU infrastructure setup, NVIDIA NIM model deployment, and Gen AI application configuration.

Prerequisites

Infrastructure Requirements

Before deploying AI Factory components:

  • Kubernetes cluster with GPU nodes configured
  • NVIDIA GPU operator installed
  • Access to NVIDIA NGC registry or private registry
  • Object storage for model profiles (air-gapped deployments)

GPU Node Configuration

Verify GPU nodes meet NVIDIA NIM requirements:

Model TypeNIM ModelGPU RequirementsContainer Image
Text Completionllama-3.3-nemotron-super-49b-v14 x L40Snvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1:1.8.5
Text Embeddingsllama-3.2-nemoretriever-300m-embed-v11 x L40Snvcr.io/nim/nvidia/llama-3.2-nemoretriever-300m-embed-v1:latest
Image Embeddingsnvclip1 x L40Snvcr.io/nim/nvidia/nvclip:latest
OCRpaddleocr1 x L40Snvcr.io/nim/baidu/paddleocr:latest
Text Rerankingllama-3.2-nv-rerankqa-1b-v21 x L40Snvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:latest

Ensure GPU nodes include:

  • Label: nvidia.com/gpu=true
  • Taint: nvidia.com/gpu

Step 1: Configure Registry Authentication

Internet-Connected Deployments

For clusters with internet access, configure NVIDIA NGC authentication:

  1. Obtain NGC API key from NVIDIA NGC Portal

  2. Create authentication secrets via HCP UI:

  • Navigate to Project SettingsSecrets
  • Create nvidia-nim-secrets with NGC_API_KEY
  • Create ngc-cred Docker registry secret

Alternatively, use kubectl:

NGC_API_KEY=<your-ngc-api-key>

# Create runtime secret
kubectl -n default create secret generic nvidia-nim-secrets \
    --from-literal=NGC_API_KEY=${NGC_API_KEY}

kubectl -n default annotate secret nvidia-nim-secrets \
    replicator.v1.mittwald.de/replicate-to='m-.*'

# Create image pull secret
kubectl -n default create secret docker-registry ngc-cred \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password=${NGC_API_KEY}

kubectl -n default annotate secret ngc-cred \
    replicator.v1.mittwald.de/replicate-to='m-.*'

Air-Gapped Deployments

For environments without internet access:

  1. Mirror NIM Images: Copy required images to private registry using skopeo
  2. Update Model URLs: Configure HCP to reference private registry locations
  3. Cache Profiles: Download and store model profiles in object storage
  4. Configure Storage Path: Reference cached profiles during model deployment

See Air-Gapped Configuration for detailed procedures.

Step 2: Access Model Library

Navigate to the Model Library interface:

  1. Log into Hybrid Manager console
  2. Select your project from the project selector
  3. Navigate to AI FactoryModel Library

The Model Library displays available NVIDIA NIM models:

  • llama-3.3-nemotron-super-49b-v1: Advanced reasoning and chat capabilities (128K context)
  • llama-3.2-nemoretriever-300m-embed-v1: High-quality text embeddings
  • llama-3.2-nv-rerankqa-1b-v2: Multilingual query-document reranking
  • paddleocr: Ultra-lightweight OCR system
  • nvclip: Multimodal embeddings for image and text

Step 3: Deploy Model Server Cluster

Select Model for Deployment

  1. Click Deploy Model in Model Library
  2. Select target model from available options
  3. Review model requirements and documentation links

Configure Deployment Parameters

Configure the model server cluster settings:

Instance Configuration

  • Server Instances: Default 1 (increase for high availability)
  • Minimum Instances: 1 (0 for scale-to-zero when supported)
  • Maximum Instances: Based on load requirements

Resource Allocation

  • Memory: Configure based on model requirements
  • Text completion models: 64-128 GB
  • Embedding models: 32-64 GB
  • CPU: Default values typically sufficient
  • GPU: Match documented requirements
  • llama-3.3-nemotron: 4 GPUs
  • Other models: 1 GPU typical

Scaling Configuration

  • Concurrent Request Threshold: Configure autoscaling trigger
  • Scale to Zero: Enable when supported (may be unavailable in initial release)

Deploy Model

  1. Review configuration summary
  2. Click Deploy to initiate deployment
  3. Monitor deployment status in Model Server Clusters view

Deployment creates:

  • KServe InferenceService resource
  • Predictor pods with configured resources
  • Service endpoints for model access

Step 4: Monitor Deployment Status

View Model Server Clusters

Navigate to AI FactoryModel Server Clusters to view:

  • Cluster display name
  • Load balancer ingress URI
  • Deployment status (Active/Healthy, Pending, Failed)
  • Model details and tags
  • Resource utilization metrics

Access Cluster Details

Click on a cluster name to view:

  • Detailed configuration parameters
  • Real-time health metrics
  • Grafana dashboard integration
  • Inference latency charts
  • System health indicators

Step 5: Configure API Access

Generate API Tokens

For external access to deployed models:

  1. Navigate to AI FactoryAPI Token Management
  2. Click Create Token
  3. Provide token reference name
  4. Store generated token securely

API tokens enable:

  • RAG application authentication
  • External service integration
  • Programmatic model access

Access Endpoints

Models expose OpenAI-compatible endpoints:

Internal Access (within cluster):

http://<service-name>.<namespace>.svc.cluster.local/v1/chat/completions

External Access (with ingress):

https://<ingress-url>/v1/chat/completions

Step 6: Build Gen AI Applications

Create Knowledge Base

  1. Navigate to AI FactoryGen AI BuilderKnowledge Bases
  2. Configure data sources:
  • PostgreSQL databases
  • Document repositories
  • API connections
  1. Select embedding model from deployed services
  2. Process content to generate embeddings

Configure Assistant

  1. Navigate to AI FactoryAssistants
  2. Create new assistant with:
  • Name and directive
  • Model endpoint selection
  • Knowledge base integration
  • Tool configuration
  1. Deploy assistant application

See Create Assistant Guide for detailed procedures.

Step 7: Update and Manage Models

Edit Model Server Cluster

To modify running clusters:

  1. Navigate to Model Server Clusters
  2. Select cluster to edit
  3. Adjust parameters:
  • Instance counts
  • Resource allocations
  • Scaling thresholds
  1. Apply changes (triggers rolling update)

Rolling Updates

Updates maintain availability through:

  • Connection draining before restart
  • Health verification between instances
  • Automatic rollback on failure

Troubleshooting

Common Deployment Issues

Model Fails to Start

  • Verify GPU availability matches requirements
  • Check registry authentication secrets
  • Review pod events and logs

High Inference Latency

  • Adjust batch size parameters
  • Increase GPU allocation
  • Scale replicas for load distribution

API Token Issues

  • Verify token hasn't expired
  • Check network policies
  • Confirm ingress configuration

For detailed diagnostics, see Troubleshooting Guide.

Additional Resources

Reference Documentation

NVIDIA Model Documentation