Configure a ClusterServingRuntime

Suggest edits

This guide explains how to configure a ClusterServingRuntime in KServe. A ClusterServingRuntime defines the environment used to serve your AI models — specifying container image, resource settings, environment variables, and supported model formats.

For Hybrid Manager users, configuring runtimes is a core step toward enabling Model Serving — see Model Serving in Hybrid Manager.

Goal

Configure a ClusterServingRuntime so it can be used by InferenceServices to deploy models.

Estimated time

5–10 minutes.

What you will accomplish

Define a ClusterServingRuntime YAML manifest.
Apply it to your Kubernetes cluster.
Enable reusable serving configuration for one or more models.

What this unlocks

Supports consistent deployment of models using a standard runtime definition.
Allows for centralized control over serving images and resource profiles.
Required step for deploying NVIDIA NIM containers with KServe.

Prerequisites

Kubernetes cluster with KServe installed.
Access to container image registry with the desired model server image.
NVIDIA GPU node pool configured (if using GPU-based models).
(If required) Kubernetes secret configured for API keys (e.g., build.nvidia.com).

For background concepts, see:

Steps

1. Create ClusterServingRuntime YAML

Create a file named ClusterServingRuntime.yaml.

Example:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: nvidia-nim-llama-3.1-8b-instruct-1.3.3
  namespace: default
spec:
  containers:
    - env:
        - name: NIM_CACHE_PATH
          value: /tmp
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: nvidia-nim-secrets
              key: NGC_API_KEY
      image: upmdev.azurecr.io/nim/meta/llama-3.1-8b-instruct:1.3.3
      name: kserve-container
      ports:
        - containerPort: 8000
          protocol: TCP
      resources:
        limits:
          cpu: "12"
          memory: 64Gi
        requests:
          cpu: "12"
          memory: 64Gi
      volumeMounts:
        - mountPath: /dev/shm
          name: dshm
imagePullSecrets:
  - name: edb-cred
protocolVersions:
  - v2
  - grpc-v2
supportedModelFormats:
  - autoSelect: true
    name: nvidia-nim-llama-3.1-8b-instruct
    priority: 1
    version: "1.3.3"
volumes:
  - emptyDir:
      medium: Memory
      sizeLimit: 16Gi
      name: dshm

Key fields explained:

containers.image: The model server container (e.g., NVIDIA NIM image).
resources: CPU, memory, and GPU requirements.
NGC_API_KEY: Secret reference for NVIDIA models.
supportedModelFormats: Logical name used by InferenceService to reference this runtime.

2. Apply the ClusterServingRuntime

Run:

kubectl apply -f ClusterServingRuntime.yaml

3. Verify deployed ClusterServingRuntime

Run:

kubectl get ClusterServingRuntime

Output

NAME                                     AGE
nvidia-nim-llama-3.1-8b-instruct-1.3.3   1m

You can inspect full details with:

kubectl get ClusterServingRuntime <name> -o yaml

4. Reference runtime in InferenceService

When you create your InferenceService, reference this runtime:

runtime: nvidia-nim-llama-3.1-8b-instruct-1.3.3
modelFormat:
name: nvidia-nim-llama-3.1-8b-instruct

See Deploy an NVIDIA NIM container with KServe.

Notes

Runtimes are reusable — you can deploy multiple models referencing the same ClusterServingRuntime.
Use meaningful names and version fields in supportedModelFormats for traceability.
You can update a runtime by editing and re-applying the YAML.

Next steps

← Prev

Model Serving How-To Guides

↑ Up