Monitor deployed models with KServe

This guide explains how to monitor deployed AI models using KServe on Kubernetes.

Monitoring your models helps ensure reliability, performance, and optimal use of resources — whether you are working in general Kubernetes or using Hybrid Manager AI Factory.

For AI Factory users, Hybrid Manager will provide additional value-add monitoring and observability features — see Model Serving in Hybrid Manager.

Goal

Monitor deployed models, check model status and serving endpoints, and retrieve resource usage information.

Estimated time

5–10 minutes.

What you will accomplish

List deployed InferenceServices (models).
Retrieve model endpoint and runtime details.
Understand how to observe model performance.
Prepare to integrate model metrics into observability pipelines.

What this unlocks

Confidence that models are correctly deployed and serving.
Ability to troubleshoot or scale model deployments.
Foundation for using Hybrid Manager AI Factory observability for model serving.

Prerequisites

Deployed InferenceService on KServe.
ClusterServingRuntime defined.
kubectl configured for your Kubernetes cluster.

For background concepts, see:

Steps

1. List deployed InferenceServices

To list deployed models and see key details:

kubectl get InferenceService \
-o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu \
--namespace=default

Key columns:

NAME: Name of the InferenceService.
MODEL: Model format name (from ClusterServingRuntime).
URL: Service endpoint for inference requests.
RUNTIME: ClusterServingRuntime used.
GPUs: Number of GPUs allocated.

2. Retrieve runtime details

To view ClusterServingRuntime details, including serving port and resource allocations:

kubectl get ClusterServingRuntimes \
-o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image,PORT:.spec.containers[0].ports[0].containerPort,CPUs:.spec.containers[0].resources.limits.cpu,MEMORY:.spec.containers[0].resources.limits.memory \
--namespace=all-namespaces

Key columns:

NAME: Name of the runtime.
IMAGE: Model server image used.
PORT: Inference port (commonly 8000 for NIM).
CPUs: CPU resources allocated.
MEMORY: Memory allocated.

3. Observe model metrics

If you enabled Prometheus scraping via InferenceService annotations:

serving.kserve.io/enable-prometheus-scraping: "true"

Then Prometheus can scrape metrics at:

/v1/metrics on port 8000 of the model service.

Metrics typically include:

Request latency
Throughput (requests per second)
Error rates
GPU utilization (if GPUs used)

You can visualize these metrics in tools such as Grafana.

4. Check pod status

For debugging, you can also view model pods directly:

kubectl get pods --namespace=default

Look for pods with names matching:

<inference-service-name>-predictor-*

Check pod status and logs if needed:

kubectl logs <pod-name> --namespace=default

Next steps

Update GPU resources for a deployed model (Coming soon)
Deploy additional NVIDIA NIM models
Model Serving in Hybrid Manager (Coming soon)

← Prev

Model Serving FAQ

↑ Up

Model Serving How-To Guides

Observability for Model Serving

Could this page be better? Report a problem or suggest an addition!