Monitor deployed models with KServe
This guide explains how to monitor deployed AI models using KServe on Kubernetes.
Monitoring your models helps ensure reliability, performance, and optimal use of resources — whether you are working in general Kubernetes or using Hybrid Manager AI Factory.
For AI Factory users, Hybrid Manager will provide additional value-add monitoring and observability features — see Model Serving in Hybrid Manager.
Goal
Monitor deployed models, check model status and serving endpoints, and retrieve resource usage information.
Estimated time
5–10 minutes.
What you will accomplish
- List deployed InferenceServices (models).
- Retrieve model endpoint and runtime details.
- Understand how to observe model performance.
- Prepare to integrate model metrics into observability pipelines.
What this unlocks
- Confidence that models are correctly deployed and serving.
- Ability to troubleshoot or scale model deployments.
- Foundation for using Hybrid Manager AI Factory observability for model serving.
Prerequisites
- Deployed InferenceService on KServe.
- ClusterServingRuntime defined.
- kubectl configured for your Kubernetes cluster.
For background concepts, see:
Steps
1. List deployed InferenceServices
To list deployed models and see key details:
kubectl get InferenceService \ -o custom-columns=NAME:.metadata.name,MODEL:.spec.predictor.model.modelFormat.name,URL:.status.address.url,RUNTIME:.spec.predictor.model.runtime,GPUs:.spec.predictor.model.resources.limits.nvidia\\.com/gpu \ --namespace=default
Key columns:
- NAME: Name of the InferenceService.
- MODEL: Model format name (from ClusterServingRuntime).
- URL: Service endpoint for inference requests.
- RUNTIME: ClusterServingRuntime used.
- GPUs: Number of GPUs allocated.
2. Retrieve runtime details
To view ClusterServingRuntime details, including serving port and resource allocations:
kubectl get ClusterServingRuntimes \ -o custom-columns=NAME:.metadata.name,IMAGE:.spec.containers[0].image,PORT:.spec.containers[0].ports[0].containerPort,CPUs:.spec.containers[0].resources.limits.cpu,MEMORY:.spec.containers[0].resources.limits.memory \ --namespace=all-namespaces
Key columns:
- NAME: Name of the runtime.
- IMAGE: Model server image used.
- PORT: Inference port (commonly 8000 for NIM).
- CPUs: CPU resources allocated.
- MEMORY: Memory allocated.
3. Observe model metrics
If you enabled Prometheus scraping via InferenceService annotations:
- serving.kserve.io/enable-prometheus-scraping: "true"
Then Prometheus can scrape metrics at:
/v1/metrics on port 8000 of the model service.
Metrics typically include:
- Request latency
- Throughput (requests per second)
- Error rates
- GPU utilization (if GPUs used)
You can visualize these metrics in tools such as Grafana.
4. Check pod status
For debugging, you can also view model pods directly:
kubectl get pods --namespace=default
Look for pods with names matching:
<inference-service-name>-predictor-*
Check pod status and logs if needed:
kubectl logs <pod-name> --namespace=default
Next steps
- Update GPU resources for a deployed model (Coming soon)
- Deploy additional NVIDIA NIM models
- Model Serving in Hybrid Manager (Coming soon)
Related reading
Could this page be better? Report a problem or suggest an addition!