Configure a ClusterServingRuntime

This guide explains how to configure a ClusterServingRuntime in KServe. A ClusterServingRuntime defines the environment used to serve your AI models — specifying container image, resource settings, environment variables, and supported model formats.

For Hybrid Manager users, configuring runtimes is a core step toward enabling Model Serving — see Model Serving in Hybrid Manager.

Goal

Configure a ClusterServingRuntime so it can be used by InferenceServices to deploy models.

Estimated time

5–10 minutes.

What you will accomplish

  • Define a ClusterServingRuntime YAML manifest.
  • Apply it to your Kubernetes cluster.
  • Enable reusable serving configuration for one or more models.

What this unlocks

  • Supports consistent deployment of models using a standard runtime definition.
  • Allows for centralized control over serving images and resource profiles.
  • Required step for deploying NVIDIA NIM containers with KServe.

Prerequisites

  • Kubernetes cluster with KServe installed.
  • Access to container image registry with the desired model server image.
  • NVIDIA GPU node pool configured (if using GPU-based models).
  • (If required) Kubernetes secret configured for API keys (e.g., NVIDIA NGC).

For background concepts, see:

Steps

1. Create ClusterServingRuntime YAML

Create a file named ClusterServingRuntime.yaml.

Example:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: nvidia-nim-llama-3.1-8b-instruct-1.3.3
  namespace: default
spec:
  containers:
    - env:
        - name: NIM_CACHE_PATH
          value: /tmp
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: nvidia-nim-secrets
              key: NGC_API_KEY
      image: upmdev.azurecr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      name: kserve-container
      ports:
        - containerPort: 8000
          protocol: TCP
      resources:
        limits:
          cpu: "12"
          memory: 64Gi
        requests:
          cpu: "12"
          memory: 64Gi
      volumeMounts:
        - mountPath: /dev/shm
          name: dshm
imagePullSecrets:
  - name: edb-cred
protocolVersions:
  - v2
  - grpc-v2
supportedModelFormats:
  - autoSelect: true
    name: nvidia-nim-llama-3.1-8b-instruct
    priority: 1
    version: "1.3.3"
volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 16Gi
      name: dshm

Key fields explained:

  • containers.image: The model server container (e.g., NVIDIA NIM image).
  • resources: CPU, memory, and GPU requirements.
  • NGC_API_KEY: Secret reference for NVIDIA models.
  • supportedModelFormats: Logical name used by InferenceService to reference this runtime.

2. Apply the ClusterServingRuntime

Run:

kubectl apply -f ClusterServingRuntime.yaml

3. Verify deployed ClusterServingRuntime

Run:

kubectl get ClusterServingRuntime
Output
NAME                                     AGE
nvidia-nim-llama-3.1-8b-instruct-1.3.3   1m

You can inspect full details with:

kubectl get ClusterServingRuntime <name> -o yaml

4. Reference runtime in InferenceService

When you create your InferenceService, reference this runtime:

runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3
modelFormat:
name: nvidia-nim-llama-3.1-8b-instruct

See Deploy an NVIDIA NIM container with KServe.

Notes

  • Runtimes are reusable — you can deploy multiple models referencing the same ClusterServingRuntime.
  • Use meaningful names and version fields in supportedModelFormats for traceability.
  • You can update a runtime by editing and re-applying the YAML.

Next steps



Could this page be better? Report a problem or suggest an addition!