Deploy an NVIDIA NIM container with KServe

This guide explains how to deploy an NVIDIA NIM container using KServe on a Kubernetes cluster.

While the steps here apply to general Kubernetes environments, in Hybrid Manager AI Factory, we provide additional value such as lifecycle management, observability, and simplified integration. Learn more in the Model Serving in Hybrid Manager section.

Goal

Deploy an NVIDIA NIM container using KServe to create a network-accessible inference service that can be consumed by applications.

Estimated time

15–30 minutes depending on cluster setup.

What you will accomplish

  • Define and deploy a ClusterServingRuntime for an NVIDIA NIM container.
  • Deploy an InferenceService that uses this runtime.
  • Validate your deployment and retrieve the model endpoint.

What this unlocks

  • Ability to serve NVIDIA NIM models via standard inference protocols (HTTP/gRPC).
  • Prepare to integrate these models with applications or tools such as Griptape (Gen AI Builder) or AIDB Knowledge Bases.
  • Foundation for using Hybrid Manager AI Factory model-serving capabilities.

Prerequisites

  • Kubernetes cluster with KServe installed.
  • GPU node pool configured (with NVIDIA device plugin).
  • NVIDIA NIM container image available in a private registry or NGC.
  • Kubernetes secret containing NGC API Key.
  • kubectl configured for your cluster.

For background concepts, see:

Steps

1. Create ClusterServingRuntime

Define ClusterServingRuntime.yaml:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: nvidia-nim-llama-3.1-8b-instruct-1.3.3
  namespace: default
spec:
  containers:
    - env:
        - name: NIM_CACHE_PATH
          value: /tmp
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: nvidia-nim-secrets
              key: NGC_API_KEY
      image: your-registry/nim/meta/llama-3.1-8b-instruct:1.3.3
      name: kserve-container
      ports:
        - containerPort: 8000
          protocol: TCP
      resources:
        limits:
          cpu: "12"
          memory: 64Gi
        requests:
          cpu: "12"
          memory: 64Gi
      volumeMounts:
        - mountPath: /dev/shm
          name: dshm
imagePullSecrets:
  - name: edb-cred
protocolVersions:
  - v2
  - grpc-v2
supportedModelFormats:
  - autoSelect: true
    name: nvidia-nim-llama-3.1-8b-instruct
    priority: 1
    version: "1.3.3"
volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 16Gi
    name: dshm

Apply the runtime:

kubectl apply -f ClusterServingRuntime.yaml

2. Create InferenceService

Define InferenceService.yaml:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
    prometheus.kserve.io/port: "8000"
    prometheus.kserve.io/path: "/v1/metrics"
  name: llama-3-1-8b-instruct-1xgpu
  namespace: default
spec:
  predictor:
    tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
    nodeSelector:
      nvidia.com/gpu: "true"
    imagePullSecrets:
      - name: edb-cred
    model:
      modelFormat:
        name: nvidia-nim-llama-3.1-8b-instruct
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          nvidia.com/gpu: "1"
      runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3

Deploy the InferenceService:

kubectl apply -f InferenceService.yaml

3. Verify deployed models

List active InferenceServices:

kubectl get InferenceService \
-o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu \
--namespace=default
Output
NAME                           MODEL                              URL                                                                         RUNTIME                                  GPUs
llama-3-1-8b-instruct-1xgpu    nvidia-nim-llama-3.1-8b-instruct   http://llama-3-1-8b-instruct-1xgpu.default.svc.cluster.local                 nvidia-nim-llama-3.1-8b-instruct-1.3.3   1

4. Retrieve runtime details

Check port and resources:

kubectl get ClusterServingRuntimes \
-o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image,PORT:.spec.containers[0].ports[0].containerPort,CPUs:.spec.containers[0].resources.limits.cpu,MEMORY:.spec.containers[0].resources.limits.memory \
--namespace=all-namespaces

Next steps


Could this page be better? Report a problem or suggest an addition!