Quickstart — Use an Inference Endpoint with Python v1.3

Use this quickstart to chat with a model deployed behind your private InferenceService (KServe). You pass the endpoint URL and access key, plus the model name. The example uses a production‑ready, single‑file Python script.

Time to complete: 5–10 minutes

Goals

  • Call a private, governed model endpoint (OpenAI‑compatible) over HTTP(s)
  • Understand required headers and payload shape for chat completions
  • Run a minimal demo script you can reuse in apps and CI/CD

Prerequisites

Environment

Set these environment variables before running the script:

export EDB_API_URL="https://<portal>/inferenceservices/<inferenceservice-id>"
export EDB_API_TOKEN="<hm-user-access-key>"
export MODEL_NAME="meta/llama-3.1-8b-instruct"  # or the model you serve

Notes:

  • For internal callers, set EDB_BASE_URL to your cluster‑local path (proxy or direct KServe URL).
  • The endpoint path for chat is /v1/chat/completions appended to EDB_BASE_URL.

Run the demo script

Download and run the production‑ready single file:

Example (custom prompt):

python hm_kserve_quickstart.py chat --prompt "Write a haiku about Postgres and GPUs."

Optional (use Hybrid Manager to list clusters and summarize):

pip install httpx typer
export EDB_API_URL="https://<hm-host>"; export EDB_API_TOKEN="<hm-access-token>"; export PROJECT_ID="<uuid>"
python hm_kserve_quickstart.py summarize-clusters --project-id "$PROJECT_ID"

Request and headers (reference)

  • Endpoint: POST ${EDB_BASE_URL}/v1/chat/completions
  • Headers: Authorization: Bearer ${EDB_API_KEY}, Accept: application/json, Content-Type: application/json
  • Body (simplified):
{
  "model": "${MODEL_NAME}",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "max_tokens": 256
}

Embeddings and rerank models use different paths:

  • Embeddings: ${EDB_BASE_URL}/v1/embeddings
  • Rerank: ${EDB_BASE_URL}/v1/ranking

If you see HTTP 404 Not Found, verify the operation‑specific path.

Best practices

  • Keep inference sovereign: prefer internal paths when the caller runs in‑cluster.
  • Rotate access keys regularly; never commit them to source control.
  • Enforce TLS and limit egress from clients.
  • Monitor latency and error rates; scale resources or tune concurrency as needed.

Troubleshooting

  • 401/403: Verify EDB_API_KEY and user permissions to the InferenceService.
  • 404: Confirm the InferenceService ID and that the service is ready.
  • Timeouts/5xx: Check KServe pod status, health probes, and logs; validate endpoint path.

Next steps

Known issues in 1.3

  • Some internal model URLs shown in listings may omit the operation suffix (for example, embeddings require /v1/embeddings). If a call returns 404, append the appropriate suffix. This will be addressed in a future release.
  • External access requires a valid Hybrid Manager user access key with the right role (for example, Gen AI Builder User). A malformed key or insufficient permissions return HTTP 401.