Monitor deployed models with KServe

This guide explains how to monitor deployed AI models using KServe on Kubernetes.

Monitoring your models helps ensure reliability, performance, and optimal use of resources — whether you are working in general Kubernetes or using Hybrid Manager AI Factory.

For AI Factory users, Hybrid Manager will provide additional value-add monitoring and observability features — see Model Serving in Hybrid Manager.

Goal

Monitor deployed models, check model status and serving endpoints, and retrieve resource usage information.

Estimated time

5–10 minutes.

What you will accomplish

  • List deployed InferenceServices (models).
  • Retrieve model endpoint and runtime details.
  • Understand how to observe model performance.
  • Prepare to integrate model metrics into observability pipelines.

What this unlocks

  • Confidence that models are correctly deployed and serving.
  • Ability to troubleshoot or scale model deployments.
  • Foundation for using Hybrid Manager AI Factory observability for model serving.

Prerequisites

  • Deployed InferenceService on KServe.
  • ClusterServingRuntime defined.
  • kubectl configured for your Kubernetes cluster.

For background concepts, see:

Steps

1. List deployed InferenceServices

To list deployed models and see key details:

kubectl get InferenceService \
-o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu \
--namespace=default

Key columns:

  • NAME: Name of the InferenceService.
  • MODEL: Model format name (from ClusterServingRuntime).
  • URL: Service endpoint for inference requests.
  • RUNTIME: ClusterServingRuntime used.
  • GPUs: Number of GPUs allocated.

2. Retrieve runtime details

To view ClusterServingRuntime details, including serving port and resource allocations:

kubectl get ClusterServingRuntimes \
-o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image,PORT:.spec.containers[0].ports[0].containerPort,CPUs:.spec.containers[0].resources.limits.cpu,MEMORY:.spec.containers[0].resources.limits.memory \
--namespace=all-namespaces

Key columns:

  • NAME: Name of the runtime.
  • IMAGE: Model server image used.
  • PORT: Inference port (commonly 8000 for NIM).
  • CPUs: CPU resources allocated.
  • MEMORY: Memory allocated.

3. Observe model metrics

If you enabled Prometheus scraping via InferenceService annotations:

  • serving.kserve.io/enable-prometheus-scraping: "true"

Then Prometheus can scrape metrics at:

/v1/metrics on port 8000 of the model service.

Metrics typically include:

  • Request latency
  • Throughput (requests per second)
  • Error rates
  • GPU utilization (if GPUs used)

You can visualize these metrics in tools such as Grafana.

4. Check pod status

For debugging, you can also view model pods directly:

kubectl get pods --namespace=default

Look for pods with names matching:

<inference-service-name>-predictor-*

Check pod status and logs if needed:

kubectl logs <pod-name> --namespace=default

Next steps



Could this page be better? Report a problem or suggest an addition!