Accessing KServe endpoints (internal and external) v1.3

Prerequisite: Access to the Hybrid Manager UI with AI Factory enabled. See /edb-postgres-ai/1.3/hybrid-manager/ai-factory/.

Use these steps to invoke model endpoints deployed with KServe.

Why and when to use this

Accessing KServe endpoints lets your applications consume private, governed model inference without depending on public AI APIs. You choose the access path that matches your security posture and network topology:

  • Internal access: Use when callers run inside the same Kubernetes environment (for example, Gen AI Builder Assistants, Pipelines jobs). This path avoids external exposure, skips a gateway hop, and thus reduces latency.
  • External access: Use when callers live outside the cluster (for example, customer‑facing apps, partner services, or shared enterprise APIs). The portal with access keys provides a controlled boundary while keeping inference sovereign.

Typical reasons to use KServe access:

  • Keep model inference in your environment for data protection, compliance, and cost control.
  • Serve OpenAI‑compatible endpoints to simplify client integration while retaining governance.
  • Support air‑gapped or restricted network environments where public AI services are not allowed.

Prerequisites

  • A deployed InferenceService. See Create an InferenceService.
  • For external access: a portal endpoint and a user access key from your Hybrid Manager environment.
  • For internal access: network access to the service inside the Kubernetes cluster.

Internal access (inside the cluster)

Call the in‑cluster address that fronts KServe (cluster‑local DNS). Replace placeholders with your values.

Example (proxy service path style):

curl -X POST \
  'http://upm-kserve-model-proxy.upm-ai-model-server.svc.cluster.local:80/inferenceservices/<inferenceservice-id>/<endpoint-path>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{ ... }'

Example (direct KServe service URL, when available in the resource status):

status:
  address:
    url: http://<inferenceservice-name>-predictor.<namespace>.svc.cluster.local

Use that URL in your HTTP client to invoke the model.

External access (outside the cluster)

Call the portal endpoint and include your Hybrid Manager user access key as a Bearer token. Replace placeholders with your values.

curl -X POST \
  'https://<portal_domain>:<portal_port>/inferenceservices/<inferenceservice-id>/<endpoint-path>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer a" \
  -d '{ ... }'

Example (OpenAI‑compatible chat completions):

curl -X POST \
  'https://portal-HM-pm.edbHM.com/inferenceservices/m-po7d0fta/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer a" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
    "max_tokens": 64
  }'

Internal vs. external URLs — where to find them

Use internal URLs for in‑cluster callers and the external portal URL for callers outside Kubernetes.

  • Internal URL (cluster‑local):
    • From SQL (AIDB): run SELECT * FROM aidb.list_HM_models(); to see internal service URLs for known models within the database context.
    • From Kubernetes: check the status.address.url on the InferenceService resource in the namespace (for example, http://<name>-predictor.<ns>.svc.cluster.local).
    • Note: Some tasks require an operation suffix (for example, embeddings require /v1/embeddings). If a listed URL returns 404, append the proper suffix.
  • External URL (portal):
    • In Hybrid Manager UI: open AI Factory → Model Serving → select your InferenceService → copy the portal endpoint (base) for the service. External calls use https://<portal>/inferenceservices/<id>/... with an HM user access key.

Endpoint paths by operation

KServe exposes different paths per operation. Use the path that matches your model/task:

  • Chat completions (LLMs): <base>/v1/chat/completions
  • Embeddings (NIM embedding models): <base>/v1/embeddings
  • Rerank (ranking models): <base>/v1/ranking

If you see HTTP 404 Not Found when calling a model, verify that you’re using the correct path for the operation.

Payloads by operation

Different tasks use different payload shapes. The examples below show common OpenAI‑compatible schemas.

  • Chat completions

    • Endpoint: <base>/v1/chat/completions
    • Payload:
      {
        "model": "<model-name>",
        "messages": [
          {"role": "user", "content": "Hello"}
        ],
        "max_tokens": 256
      }
  • Embeddings

    • Endpoint: <base>/v1/embeddings
    • Payload:
      {
        "model": "<model-name>",
        "input": "I am the Yonk, I work in product"
      }
  • Rerank

    • Endpoint: <base>/v1/ranking
    • Payload (example shape; check your model’s schema):
      {
        "model": "<model-name>",
        "query": "search phrase",
        "documents": ["doc one", "doc two"],
        "top_n": 5
      }

Notes:

  • For AIDB SQL helpers (for example, aidb.encode_text()), the database will construct the HTTP call for you. If you configure a model manually with an internal URL, set the correct operation path (for example, /v1/embeddings) to avoid 404s.
  • Some models expose additional operation paths; consult the model reference or your InferenceService documentation.

Security considerations

  • Prefer internal access for services that do not need to be exposed externally.
  • Rotate access keys regularly and scope them by user; do not embed keys in client code repositories.
  • Enforce TLS for external calls; use certificates managed by your platform team.
  • Limit egress from clients where possible; monitor usage via observability.

Troubleshooting

  • 404 - Please check you are using the correct endpoints. (e.g. internal/external with matching id to model name)
  • 401/403 on external calls: Verify the Authorization header and that the user has access to the InferenceService.
  • Timeouts or 503: Check the InferenceService status, health probes, and backend logs; confirm that the endpoint path is correct.
  • High latency: Right‑size resources or adjust concurrency. See Update GPU resources.

Known issues in 1.3

  • External access requires a valid Hybrid Manager user access key with sufficient role permissions (for example, Gen AI Builder User). Invalid or malformed keys return HTTP 401 Unauthorized.