TL;DR
- Open-source Kubernetes-native model serving framework originally launched as KFServing under Kubeflow in 2019, renamed to KServe and split into a top-level project in 2021, donated to the CNCF Sandbox in 2022 and promoted to Incubating in 2024. Apache 2.0, governed by an open Steering Committee with maintainers from Bloomberg, IBM, Red Hat, Google, NVIDIA, Cisco, AWS and Yobitel.
- Built around the InferenceService CRD — a single declarative spec covering predictor, transformer, explainer, autoscaling, traffic splitting, canary, storage initialiser, OIDC auth and OpenAI-compatible LLM endpoints — and the ServingRuntime / ClusterServingRuntime CRDs that pin runtime images and arg templates.
- Ships built-in runtimes for vLLM (`kserve-vllmserver`), Triton (`kserve-tritonserver`), MLServer (`kserve-mlserver`), HuggingFace TGI / native (`kserve-huggingfaceserver`, `kserve-tgiserver`), TorchServe, TF Serving, XGBoost, LightGBM, PMML and ONNX Runtime. Custom ServingRuntimes are a 30-line YAML.
- Two deployment stacks: Serverless (Knative + Istio) for scale-to-zero and request-level routing, and Raw (vanilla Deployment + HPA + Gateway API) for LLM workloads where cold-start latency makes scale-to-zero impractical. Most 2026 LLM deployments use Raw.
- Default LLM serving path inside Yobitel's Yobibyte platform — every InferenceBench-scored vLLM, TensorRT-LLM and Triton endpoint runs through a KServe InferenceService, autoscaled on concurrency and routed via the platform gateway across H100, H200 and B200 tenancies.
Overview#
KServe is the Kubernetes-native abstraction for serving machine-learning models. Before it landed, deploying a model on Kubernetes meant assembling a Deployment, Service, HorizontalPodAutoscaler, Ingress, ConfigMap, ServiceAccount and storage-initialiser sidecar yourself for every model. Every team did this differently. KServe collapses the whole thing into a single InferenceService CRD: declare a model URI, a runtime and a scaling envelope, and the controller stands up the right pods, services, routes and scalers in the cluster.
The original project (KFServing) shipped in 2019 under the Kubeflow umbrella. It was renamed KServe and split into a top-level project in 2021 to broaden adoption beyond Kubeflow users, donated to the CNCF Sandbox in 2022 and promoted to Incubating in early 2024. By mid-2026 it sits on v0.14+ with maintainers from Bloomberg, IBM, Red Hat, Google, NVIDIA, Cisco, AWS and Yobitel. The release cadence is roughly quarterly, with patch releases between.
The CRD layer is intentionally runtime-agnostic. KServe does not implement model serving itself — it composes existing runtimes (vLLM, Triton, TorchServe, MLServer, HuggingFace TGI, TF Serving, ONNX Runtime) into a uniform deployment surface. The same InferenceService spec works for a 7B LLM behind vLLM, a vision model behind Triton, an XGBoost classifier behind MLServer or an ensemble pipeline composing several runtimes — and the operator handles the scaling, routing, canary and observability story for all of them.
Yobibyte exposes KServe-compatible Inference resources to customers as the managed service surface — the Yobitel platform reconciles the underlying InferenceService and ServingRuntime objects on the customer's behalf, so consumers never own the CRD lifecycle, autoscaler choice or runtime image pin themselves. This entry documents the production surface: the CRDs, the built-in runtimes, the deployment modes, the workload patterns, sizing, observability, security, migration and troubleshooting, for teams that do operate KServe directly on their own clusters. This entry helps you stand up KServe on your Kubernetes cluster — or recognise what Yobibyte does on your behalf as a managed service. It assumes the cluster already has the NVIDIA GPU Operator installed for GPU-bound workloads.
Quick start#
The example below installs KServe in Raw mode (no Knative, no Istio), then deploys Llama 3.1 8B Instruct on a single H100 via the built-in `kserve-vllmserver` runtime, then issues an OpenAI-compatible chat completion against the resulting endpoint. The first block installs the controller; the second block applies an InferenceService; the third block hits the endpoint with `curl`.
# 1. Install KServe in Raw mode (recommended for LLM workloads)
KSERVE_VERSION="v0.14.0"
kubectl apply --server-side -f \
"https://github.com/kserve/kserve/releases/download/$KSERVE_VERSION/kserve.yaml"
kubectl apply --server-side -f \
"https://github.com/kserve/kserve/releases/download/$KSERVE_VERSION/kserve-cluster-resources.yaml"
# 2. Deploy Llama 3.1 8B Instruct via vLLM ClusterServingRuntime
cat <<'YAML' | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama3-8b
annotations:
serving.kserve.io/deploymentMode: RawDeployment
serving.kserve.io/autoscalerClass: hpa
spec:
predictor:
minReplicas: 1
maxReplicas: 4
scaleTarget: 80
scaleMetric: concurrency
model:
modelFormat: { name: huggingface }
runtime: kserve-vllmserver
args:
- --model=meta-llama/Meta-Llama-3.1-8B-Instruct
- --max-model-len=16384
- --quantization=fp8
- --enable-prefix-caching
resources:
limits: { nvidia.com/gpu: 1, cpu: 8, memory: 64Gi }
YAML
# 3. Wait for it to come Ready, then send a request via the OpenAI API
kubectl wait --for=condition=Ready inferenceservice/llama3-8b --timeout=20m
INGRESS=$(kubectl get inferenceservice llama3-8b \
-o jsonpath='{.status.url}' | sed 's|https://||')
curl -k "https://$INGRESS/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role":"user","content":"Explain KServe in 2 lines."}],
"max_tokens": 128
}'Skip Knative for LLM workloads. A 70B model takes 60-180 seconds to load weights from S3 onto GPU memory — scale-to-zero turns every cold start into a guaranteed timeout. Set `serving.kserve.io/deploymentMode: RawDeployment` and use HPA on concurrency.
How it works#
KServe is structured as a Kubernetes controller (Go, ~80K lines) that watches the InferenceService CRD and reconciles it into the right set of Knative Services, Deployments, HPAs, VirtualServices / HTTPRoutes, ServiceAccounts and ServingRuntime references. The runtime container itself comes from a ServingRuntime or ClusterServingRuntime resource, which holds the image, command, port spec and arg template — KServe substitutes the InferenceService's `model.args` and `model.storageUri` into the template at pod creation.
On the data plane there are two architectures. In Serverless mode (the original) the predictor is a Knative Service, which means request-level autoscaling, scale-to-zero, traffic splitting via VirtualServices, and Istio sidecars enforcing mTLS. In Raw mode (introduced in v0.10, dominant by 2026) the predictor is a vanilla Kubernetes Deployment plus an HPA plus a Gateway API HTTPRoute — same surface, no Knative or Istio dependency. The choice is per-InferenceService via the `serving.kserve.io/deploymentMode` annotation.
A typical request flow on Raw mode: client → cluster ingress (Envoy / NGINX / cloud LB) → HTTPRoute → predictor Service → predictor Pod → ServingRuntime container (e.g. vLLM listening on :8080) → response. With a Transformer in the spec, the predictor Service points at the transformer pod which then forwards to the predictor pod via the in-cluster network. With ModelMesh enabled, multiple models share predictor pods with LRU-style loading.
Storage is handled by the `storage-initializer` InitContainer. When the InferenceService specifies `model.storageUri: s3://bucket/path`, the init container pulls weights into an emptyDir or PVC before the runtime container starts. Supported schemes include `s3://`, `gs://`, `abfs://`, `pvc://`, `hf://`, `oci://` and `http(s)://`. Credentials are sourced from a referenced ServiceAccount with mounted secrets or IRSA / Workload Identity bindings.
- Reconciliation loop — single controller watches InferenceService, ServingRuntime, ClusterServingRuntime and downstream resources. State lives only in etcd.
- Predictor / Transformer / Explainer — three optional components per InferenceService; the controller wires them as a request chain.
- ModelMesh — sidecar architecture for thousands of small models with LRU loading; separate controller, same CRD surface.
- Storage initializer — pulls weights before the runtime starts; idempotent across pod restarts via emptyDir lifetime.
- Open Inference Protocol (v2) — the standard predict/explain gRPC + REST API every runtime adheres to (LLM runtimes also expose OpenAI-compatible endpoints).
- Gateway API integration — Raw mode now uses Gateway API HTTPRoutes by default; falls back to Ingress on older clusters.
Serverless mode requires Knative Serving 1.14+ and Istio (or Kourier as a lighter alternative). Raw mode requires only a Gateway API implementation. If you do not already run Istio for other reasons, picking Raw is a smaller blast-radius dependency.
Reference: InferenceService spec#
The InferenceService spec has ~50 top-level and nested fields. The table below covers the ones that matter on every deployment. Defaults are taken from v0.14 (mid-2026). The full schema lives at `serving.kserve.io/v1beta1`; v1alpha1 ModelSpec fields are still supported for ModelMesh use cases.
| Field | Type | Default | Purpose |
|---|---|---|---|
| predictor.model.modelFormat.name | string | (required) | huggingface | pytorch | tensorflow | sklearn | xgboost | onnx | triton | custom. Drives runtime selection if `runtime` is unset. |
| predictor.model.runtime | string | (auto) | Explicit ServingRuntime reference. Use for LLMs: `kserve-vllmserver`, `kserve-tritonserver`, `kserve-huggingfaceserver`. |
| predictor.model.storageUri | string | (none) | Where to pull weights from: `s3://`, `gs://`, `hf://`, `pvc://`, `oci://`. For HF runtime use `--model=org/name` arg instead. |
| predictor.model.args | []string | [] | Runtime args appended to ServingRuntime container command. |
| predictor.model.env | []EnvVar | [] | Environment variables for the runtime container. |
| predictor.model.resources | ResourceRequirements | (none) | CPU / memory / GPU requests and limits. Required for `nvidia.com/gpu`. |
| predictor.minReplicas | int | 1 | Lower bound. Set to 0 only on Serverless mode and only for small / cheap workloads. |
| predictor.maxReplicas | int | (unbounded) | Upper bound for the HPA / Knative autoscaler. |
| predictor.scaleTarget | int | (runtime-default) | Target value for the scale metric (e.g. 80 concurrent requests). |
| predictor.scaleMetric | string | concurrency | concurrency | rps | cpu | memory. LLMs almost always use concurrency. |
| predictor.containerConcurrency | int | 0 | Knative-only. Hard cap on concurrent requests per pod; 0 = unlimited. |
| predictor.timeout | int (s) | 60 | Request timeout. Raise to 300-600 for long-context LLMs. |
| predictor.serviceAccountName | string | default | Used by storage-initializer to pull weights from S3/GCS/etc. |
| predictor.nodeSelector / tolerations / affinity | object | {} | Constrain to GPU node pool or MIG slice profile. |
| transformer.containers | []Container | (none) | Custom pre/post-processing pod (tokenisation, image resize). Replaces predictor as the request entry point. |
| explainer.containers | []Container | (none) | Interpretability sidecar (Alibi, SHAP, ART). |
| canaryTrafficPercent | int | 0 | Percentage of traffic routed to the latest revision; rest stays on default. |
| spec.predictor.workerSpec | object | (none) | Multi-node worker spec (Ray / MPI) for tensor- or pipeline-parallel inference across nodes. |
| metadata.annotations.serving.kserve.io/deploymentMode | string | Serverless | RawDeployment | Serverless | ModelMesh. RawDeployment is dominant for LLMs. |
| metadata.annotations.serving.kserve.io/autoscalerClass | string | knative | knative | hpa | external. Use `hpa` with RawDeployment. |
| metadata.annotations.serving.kserve.io/storage-initializer-cpu / memory | string | 100m / 100Mi | Raise for large model pulls — 70B models need 4Gi+. |
| metadata.annotations.autoscaling.knative.dev/metric | string | concurrency | Serverless mode override; rps | concurrency | cpu | memory. |
| metadata.annotations.serving.kserve.io/enable-prometheus-scraping | string | true | Enables Prometheus annotations on predictor pods. |
| metadata.annotations.security.kserve.io/disable-istio-sidecar | string | false | Bypass Istio mTLS on Serverless mode. |
`predictor.minReplicas: 0` is tempting and almost always wrong for LLMs. Cold start on a 7B model is 30-90s; on a 70B model it is 2-5 minutes. Either keep one warm or accept guaranteed timeouts on the first request after a scale-down.
Workload patterns#
Three patterns cover the bulk of production KServe deployments. First, an OpenAI-compatible LLM endpoint backed by vLLM. Second, canary deployment of a new model version against a live default. Third, an ensemble pipeline composing a Transformer (pre/post-processing) with a Predictor.
Pattern A — OpenAI-compatible LLM endpoint. Use the `kserve-vllmserver` ClusterServingRuntime, set Raw deployment mode, HPA on concurrency, minReplicas 1+, maxReplicas based on tenant peak. The runtime exposes `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings` and `/v1/models` paths.
Pattern B — Canary deployment. Submit a second InferenceService revision (same name, new model URI), then set `canaryTrafficPercent: 10` on the top-level spec. KServe creates a second predictor and routes 10% of traffic to it. Promote by updating the default to the new revision and setting canary back to 0; roll back by setting it to 0 without promotion.
Pattern C — Transformer + Predictor. Add a `transformer` spec with a custom container that pre-processes incoming requests (e.g. PDF → text extraction, image → tensor) and forwards to the predictor. KServe wires the request chain transparently; the client sees one endpoint.
# A — OpenAI-compatible LLM endpoint on 4x H100
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama3-70b
annotations:
serving.kserve.io/deploymentMode: RawDeployment
serving.kserve.io/autoscalerClass: hpa
spec:
predictor:
minReplicas: 2
maxReplicas: 16
scaleTarget: 64
scaleMetric: concurrency
timeout: 600
model:
modelFormat: { name: huggingface }
runtime: kserve-vllmserver
args:
- --model=meta-llama/Meta-Llama-3.1-70B-Instruct
- --tensor-parallel-size=4
- --max-model-len=32768
- --quantization=fp8
- --kv-cache-dtype=fp8
- --enable-prefix-caching
- --enable-chunked-prefill
resources:
limits: { nvidia.com/gpu: 4, cpu: 32, memory: 256Gi }
---
# B — Canary 10% traffic to a new revision
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama3-70b
spec:
canaryTrafficPercent: 10
predictor:
minReplicas: 2
maxReplicas: 16
model:
modelFormat: { name: huggingface }
runtime: kserve-vllmserver
args:
- --model=meta-llama/Meta-Llama-3.1-70B-Instruct-v2
- --tensor-parallel-size=4
resources:
limits: { nvidia.com/gpu: 4 }
---
# C — Ensemble: transformer (PDF -> text) + predictor (embeddings)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: pdf-embed }
spec:
transformer:
containers:
- name: pdf-extract
image: example/pdf-to-text:1.4
ports: [{ containerPort: 8080 }]
predictor:
minReplicas: 2
maxReplicas: 8
model:
modelFormat: { name: huggingface }
runtime: kserve-huggingfaceserver
args:
- --model=BAAI/bge-m3
- --task=feature-extraction
resources:
limits: { nvidia.com/gpu: 1 }Pattern A's prefix-cache hit rate is the single biggest cost lever. If multiple tenants share a system prompt, send them through the same InferenceService and let vLLM hash-share the cached prefix. If they must not, run one InferenceService per tenant and accept the cache miss.
Sizing and capacity planning#
KServe's own footprint is small — controller pod ~150 mCPU + 256 MiB, plus a webhook pod, plus the storage-initializer init container which runs once per pod start. The real sizing question is the runtime: vLLM, Triton, TGI sized per their own footprint, plus an HPA envelope. The table below is the planning model for typical LLM serving on H100 / H200 / B200, assuming the predictor is vLLM with FP8 weights and KV.
- minReplicas ≥ 2 for any production LLM endpoint — a single replica means every pod restart drops the endpoint, and HPA cannot scale below 1 in time to absorb a request spike.
- Storage-initializer needs 4-8 GiB CPU memory for 70B model pulls; raise via annotations. Default 100Mi will fail.
- Plan ingress capacity for the peak `maxReplicas x scaleTarget` request count. A 16-replica deployment scaling on 64 concurrent requests handles 1,024 concurrent requests at peak — your gateway must cope.
- Per-pod cold start dominates the user-visible scale-up time: 30-90s for 7B, 2-5 minutes for 70B. Pre-warm a buffer replica during predictable traffic peaks.
| Model | Runtime | Hardware | minReplicas | maxReplicas | scaleTarget | Per-pod tok/s |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | kserve-vllmserver | 1x H100 SXM5 | 1 | 8 | 80 | 3,800-5,200 |
| Llama 3.1 70B | kserve-vllmserver | 4x H100 SXM5 | 2 | 16 | 64 | 2,800-4,200 |
| Llama 3.1 70B (high QPS) | kserve-vllmserver | 8x H100 SXM5 | 2 | 8 | 128 | 5,200-7,800 |
| Llama 3.1 70B (128K ctx) | kserve-vllmserver | 2x H200 141GB | 1 | 6 | 32 | 1,400-2,200 |
| Mixtral 8x22B | kserve-vllmserver | 8x H100 SXM5 | 2 | 8 | 96 | 4,500-6,800 |
| Llama 3.1 70B (Blackwell) | kserve-vllmserver | 4x B200 | 2 | 8 | 128 | 6,800-10,500 |
| BGE-M3 embeddings | kserve-huggingfaceserver | 1x L40S 48GB | 2 | 16 | 200 | n/a (4k req/s) |
| XGBoost classifier | kserve-mlserver | CPU (4 vCPU) | 2 | 32 | 300 rps | n/a |
| ResNet50 vision | kserve-tritonserver | 1x L4 24GB | 2 | 16 | 150 | n/a (500 img/s) |
Limits and quotas#
KServe inherits Kubernetes' limit / quota model. The CRD itself imposes few hard limits; the practical ceilings come from etcd object size, gateway concurrency, and the underlying runtime.
| Limit | Default / ceiling | How to raise |
|---|---|---|
| InferenceServices per namespace | ResourceQuota-bounded | Set `count/inferenceservices.serving.kserve.io` in ResourceQuota. |
| Predictor pod resource size | Cluster maxPodResources | Node capacity-bounded; ensure node pool has matching SKUs. |
| Predictor replicas | HPA-bounded (default 100) | Set `--horizontal-pod-autoscaler-cpu-initialization-period` and raise HPA max. |
| InferenceService spec size | etcd 1.5 MiB | Keep `args`/`env` modest; avoid embedding large config in CR. |
| Storage-initializer pull size | PVC / emptyDir size | Use PVC with explicit size; emptyDir uses node ephemeral storage. |
| Request body size | Gateway-bounded | Configure ingress / Gateway API. Default Envoy is 1 MiB; raise for batch. |
| Request timeout | 60s default | Raise `predictor.timeout`; ensure gateway timeout matches. |
| Concurrent revisions | Knative quota | Serverless only; configure via `config-defaults` ConfigMap. |
| ServingRuntimes per cluster | etcd-bounded | ClusterServingRuntime CRDs are cheap; no practical limit. |
| ModelMesh models per pod | Runtime-defined (~50-200) | Tune via ModelMesh `ServingRuntime` `multiModel: true` config. |
The default request timeout (60s) is shorter than 70B LLM completion latency for long outputs. Raise `predictor.timeout` to 300-600s on every LLM InferenceService, and raise the gateway / ingress timeout to match — otherwise the client sees a 504 even when the model is still generating.
Observability#
KServe exposes Prometheus metrics from three sources: the controller (`kserve_controller_*` reconcile counts), the predictor pod (runtime-specific — `vllm:*`, `triton:*`, `mlserver:*`), and the in-cluster gateway. The InferenceService status reports current and desired replicas, traffic split between default and canary, and the live URL. Standard Grafana dashboards (KServe-published 11 and `12239` for DCGM alongside) give a turnkey view. For LLMs, pair runtime metrics with DCGM GPU metrics — a queue depth spike that does not correlate with GPU utilisation usually means the bottleneck is elsewhere (storage, gateway, tokenisation).
- kserve_controller_reconcile_total / _errors_total — controller-side health.
- kserve_inference_service_status_ready — gauge of Ready InferenceServices per namespace.
- predictor request_count / request_duration_seconds — surfaced by every runtime via the Open Inference Protocol metrics path.
- vllm:time_to_first_token_seconds / vllm:gpu_cache_usage_perc — for vLLM-backed predictors.
- nv_inference_request_duration_us / nv_inference_queue_duration_us — for Triton-backed predictors.
- DCGM_FI_DEV_GPU_UTIL — pair with predictor metrics to distinguish compute, memory and idle bottlenecks.
- envoy_cluster_upstream_rq_time / kserve_request_count — gateway-side latency and rate.
# Prometheus alerts for a KServe deployment
groups:
- name: kserve-sla
interval: 30s
rules:
- alert: InferenceServiceNotReady
expr: kserve_inference_service_status_ready == 0
for: 10m
labels: { severity: critical }
annotations:
summary: "InferenceService {{ $labels.name }} in {{ $labels.namespace }} not Ready"
- alert: KServeControllerReconcileFailing
expr: rate(kserve_controller_reconcile_errors_total[10m]) > 0
for: 15m
labels: { severity: critical }
- alert: PredictorPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total{namespace=~"serving|kserve-.*"}[15m]) > 0.2
for: 10m
labels: { severity: warning }
- alert: VLLMPredictorTTFTHigh
expr: histogram_quantile(0.95,
sum by (le, model_name) (
rate(vllm:time_to_first_token_seconds_bucket[5m]))) > 1.0
for: 5m
labels: { severity: warning }
- alert: CanaryRolloutDegraded
expr: kserve_revision_request_error_rate{revision="canary"} > 0.05
for: 10m
labels: { severity: warning }
annotations:
summary: "Canary revision error rate >5% — consider rollback"Wire the predictor's runtime metrics, gateway metrics and DCGM metrics into the same dashboard from day one. The most common production confusion is debugging a latency spike with only one of the three in view — answers always require all three.
Cost and FinOps#
KServe is free (Apache 2.0); the cost surface is the runtime hardware and the autoscaling envelope. Two levers dominate: how aggressively you scale (replica count x duration) and how efficiently each replica converts GPU-time to served tokens. The table below uses representative cloud GPU rates and InferenceBench throughput anchors to translate KServe deployment shapes into $/M token costs.
- Set `minReplicas` to actual steady-state demand divided by per-pod capacity, not to 1. Under-provisioning means scale-up cold start = user-visible latency.
- HPA `scaleTarget` directly trades off cost vs latency tail. Higher target = fewer pods = lower cost but longer p99. Pick deliberately, not by default.
- Canary deployments cost the full predictor — a 10% canary on a 4-replica default is still 4 extra GPUs of spend. Time-box every canary.
- Spot or pre-emptible GPU node pools cut hourly rates 40-60% but require `predictor.maxReplicas` headroom to absorb pre-emption events without dropping SLA.
- ModelMesh changes the math for thousands of small models — co-location amortises per-pod overhead and can cut cost 10-50x for classical ML model fleets.
| Configuration | Replicas | GPU rate ($/h) | Sustained tok/s | $/M output tokens |
|---|---|---|---|---|
| Llama 3.1 8B, kserve-vllmserver, 1x H100 | 1-4 | $3.20 | 4,500 | $0.20 |
| Llama 3.1 70B, kserve-vllmserver, 4x H100 | 2-8 | $12.40 | 3,500 | $0.98 |
| Llama 3.1 70B, kserve-vllmserver, 8x H100 | 2-4 | $24.80 | 6,800 | $1.01 |
| Llama 3.1 70B (128K), kserve-vllmserver, 2x H200 | 1-3 | $8.40 | 1,800 | $1.30 |
| Mixtral 8x22B, kserve-vllmserver, 8x H100 | 2-4 | $24.80 | 6,200 | $1.11 |
| Llama 3.1 70B FP4, kserve-vllmserver, 4x B200 | 2-6 | $22.00 | 9,200 | $0.66 |
| Llama 3.1 70B, kserve-tritonserver + TRT-LLM, 4x H100 | 2-4 | $12.40 | 4,200 | $0.82 |
| BGE-M3 embeddings, kserve-huggingfaceserver, 1x L40S | 2-8 | $1.40 | n/a | $0.05/M tokens embedded |
Security and compliance#
KServe inherits Kubernetes' RBAC, NetworkPolicy and Pod Security Standards. On Serverless mode, Istio sidecars provide mTLS between predictor pods and any upstream consumer. On Raw mode, the cluster's Gateway API implementation (Envoy Gateway, Istio Gateway, NGINX Gateway Fabric) handles TLS termination and authentication — typically OIDC or signed-JWT at the gateway, with KServe itself unaware of identity. The Open Inference Protocol does not specify auth; that is the gateway's job.
Model artifacts are pulled by the storage-initializer using credentials sourced from the ServiceAccount on the predictor pod. On AWS this is typically IRSA (IAM Roles for Service Accounts); on GCP it is Workload Identity; on Azure it is AAD Pod Identity. The pulled weights live in an emptyDir (RAM-backed or disk) and are deleted when the pod terminates. Avoid PVC-backed weight stores unless you specifically need them shared across pods; emptyDir + storage-initializer is the cleaner model for most workloads.
Regulatory implications are workload-, data- and deployment-specific. For UK NCSC Cloud Security Principles, KServe is a control-plane component; principles 2 (Asset protection — encrypt weights at rest in object storage), 3 (Separation between users — namespace + NetworkPolicy isolation between tenants), 5 (Operational security — GitOps + versioned InferenceServices) and 9 (Secure user management — gateway-layer OIDC) are the relevant ones. For GDPR Article 32, the predictor processes prompts and completions only in pod memory; ensure logging redacts PII. For HIPAA, deploy inside a BAA-covered VPC and disable the request-body capture in runtime logs.
Default Knative configuration logs request bodies in the activator at debug level. For PII / PHI workloads on Serverless mode, set `logging.request-log-template=""` in the Knative `config-observability` ConfigMap to suppress request-body capture entirely.
Migration and alternatives#
Most production migrations to KServe come from one of four origins: raw Deployment + Service + HPA, Seldon Core, BentoML, or a managed SaaS API (SageMaker, Vertex, Bedrock). The first delivers operational simplification with little behaviour change; the second is a near-equivalent CRD swap; the third trades Pythonic bento ergonomics for Kubernetes-native operations; the fourth trades cost for control. The table summarises the path from each starting point.
| From | Effort | Trade-offs | Notes |
|---|---|---|---|
| Raw Deployment + Service + HPA | Low | Lose hand-tuned flexibility, gain autoscaling + canary + GitOps fit | Wrap existing container in a ServingRuntime; same image, same args. |
| Seldon Core v1 (Apache 2.0) | Low-medium | Lose Seldon's inference-graph DAG; gain CNCF governance + LLM-focused runtimes | InferenceGraph not 1:1; rebuild multi-step routing in the gateway or Ray Serve. |
| Seldon Core v2 (BSL) | Medium | Same as v1 plus give up MLServer co-location optimisations | ModelMesh covers similar ground for many-small-model fleets. |
| BentoML / Bento Cloud | Medium | Lose bento packaging ergonomics; gain Kubernetes-native operations | Bento services map cleanly to ServingRuntime + InferenceService. |
| SageMaker / Vertex / Bedrock | High | Gain control, sovereignty, on-prem option; lose hosted model variety | Re-platform to KServe + vLLM or Triton; expect 30-70% cost reduction at scale. |
| NVIDIA Triton standalone | Trivial | Gain CRD lifecycle; lose nothing | Use `kserve-tritonserver` ClusterServingRuntime; existing model repo works unchanged. |
| TorchServe standalone | Low | Gain CRD + autoscaling; lose nothing | Use `kserve-torchserve` runtime; same `.mar` archives work. |
| Hugging Face Inference Endpoints | Medium | Gain sovereignty + cost control; lose managed convenience | Use `kserve-huggingfaceserver` or `kserve-tgiserver`. |
| vs Yobibyte managed alternative | n/a | Keep managed convenience and gain UK / EU sovereignty; give up direct CRD ownership | Yobibyte exposes a KServe-compatible Inference resource without customers operating KServe themselves — the InferenceService, ServingRuntime, autoscaler and gateway are reconciled by the Yobitel platform across NeoCloud regions. |
# Migration: from a raw Deployment + Service to KServe
# Before — hand-rolled vLLM Deployment + Service + HPA + Ingress (several YAMLs)
# After — single InferenceService
cat <<'YAML' | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama3-70b
namespace: ml-platform
annotations:
serving.kserve.io/deploymentMode: RawDeployment
serving.kserve.io/autoscalerClass: hpa
spec:
predictor:
minReplicas: 2
maxReplicas: 8
scaleTarget: 64
scaleMetric: concurrency
model:
modelFormat: { name: huggingface }
runtime: kserve-vllmserver
args:
- --model=meta-llama/Meta-Llama-3.1-70B-Instruct
- --tensor-parallel-size=4
- --quantization=fp8
- --enable-prefix-caching
resources:
limits: { nvidia.com/gpu: 4 }
YAML
# Cut over traffic at the gateway; old Deployment can be deleted once verified
kubectl delete deployment/llama3-70b-vllm service/llama3-70b-vllm \
hpa/llama3-70b-vllm ingress/llama3-70b-vllmTroubleshooting#
The error table below covers the failure modes that account for the bulk of production KServe incidents observed on Yobitel-operated fleets and the upstream issue tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.
| Symptom | Cause | Fix |
|---|---|---|
| InferenceService stuck in `LatestDeploymentReady=False` | Predictor pod not scheduling — GPU not available or runtime image pull failing. | Check predictor pod events; verify `nvidia.com/gpu` request matches node availability; check image pull secrets. |
| Cold start times out client request | Predictor minReplicas=0 + slow model load. | Set `minReplicas: 1`; raise client / gateway timeout to 300-600s; pre-warm before traffic cutover. |
| `storage-initializer` OOMKilled | Default 100Mi memory too small for large model pull. | Set `serving.kserve.io/storage-initializer-memory: 8Gi` annotation. |
| Storage URI auth failure (S3 403) | ServiceAccount missing IRSA / Workload Identity binding. | Annotate SA with the IAM role / GSA; ensure the role has `s3:GetObject` on the bucket prefix. |
| Runtime not found error | ServingRuntime / ClusterServingRuntime missing in cluster. | Apply `kserve-cluster-resources.yaml` from the release; verify `kubectl get clusterservingruntimes`. |
| Predictor scaling but never serving | HPA scaling on CPU instead of concurrency. | Set `serving.kserve.io/autoscalerClass: hpa` + `scaleMetric: concurrency` and ensure metrics-server / Prometheus Adapter is configured. |
| Canary traffic split not taking effect | Gateway implementation doesn't support weighted routes. | Verify Gateway API implementation supports HTTPRoute weights; or switch to Serverless mode + Knative. |
| Inference latency much worse than standalone runtime benchmark | Istio sidecar overhead on Serverless mode. | Disable Istio sidecar via annotation, or switch to Raw mode. |
| InferenceService URL returns 404 from outside the cluster | Gateway not exposed externally, or hostname not configured. | Check Gateway / Ingress status; verify DNS points to gateway external IP. |
| Multi-node predictor (workerSpec) hangs at NCCL init | /dev/shm too small or worker pods not on same NVLink island. | Mount `/dev/shm >= 8Gi`; add pod affinity for same node pool. |
| ModelMesh runtime returns `MODEL_NOT_LOADED` | LRU evicted the model; load timing too long. | Raise `modelTTL` on ServingRuntime; raise `multiModel.replicas`. |
| Predictor pod logs say `OPENAI_API_KEY required` | vLLM runtime args missing api-key disable. | Add `--api-key=EMPTY` to runtime args, or set `OPENAI_API_KEY` env. |
Where this fits in the Yobitel stack#
KServe is the default model-serving abstraction inside Yobitel's Yobibyte platform. Every inference endpoint a customer deploys through Yobibyte — whether it lands on vLLM, TensorRT-LLM under Triton, or a HuggingFace embedding runtime — runs as a KServe InferenceService underneath. The Yobibyte control plane is what translates the customer-facing Workspace + Inference primitives into the InferenceService + ServingRuntime objects on the underlying cluster, then handles autoscaling envelopes, prefix-cache-aware routing across replicas, multi-tenant network isolation and FOCUS-conformant cost attribution.
Omniscient Compute scores KServe-deployed runtimes continuously on InferenceBench v3 across H100, H200, B200 and MI300X tenancies, with each ClusterServingRuntime + flag combination measured at fixed input/output token mixes (chat, RAG, long-context, batch). The recommended `scaleTarget`, replica count and runtime args on the Yobibyte console come from an InferenceBench measurement, not a vendor datasheet.
For UK and EU sovereign workloads, KServe runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of an Apache 2.0 CRD layer, open-source runtimes, sovereign hardware and transparent benchmarking is what lets Yobitel customers deploy production model endpoints on Kubernetes without ceding control, latency budget or cost transparency to a hosted SaaS API.
References
- KServe Documentation · KServe Project
- kserve on GitHub · GitHub
- CNCF KServe Project Page · CNCF
- Open Inference Protocol · GitHub
- ModelMesh Serving · GitHub
- Knative Serving · Knative Project
- Gateway API · Kubernetes SIGs