KServe

TL;DR

Open-source Kubernetes-native model serving framework originally launched as KFServing under Kubeflow in 2019, renamed to KServe and split into a top-level project in 2021, donated to the CNCF Sandbox in 2022 and promoted to Incubating in 2024. Apache 2.0, governed by an open Steering Committee with maintainers from Bloomberg, IBM, Red Hat, Google, NVIDIA, Cisco, AWS and Yobitel.
Built around the InferenceService CRD — a single declarative spec covering predictor, transformer, explainer, autoscaling, traffic splitting, canary, storage initialiser, OIDC auth and OpenAI-compatible LLM endpoints — and the ServingRuntime / ClusterServingRuntime CRDs that pin runtime images and arg templates.
Ships built-in runtimes for vLLM (`kserve-vllmserver`), Triton (`kserve-tritonserver`), MLServer (`kserve-mlserver`), HuggingFace TGI / native (`kserve-huggingfaceserver`, `kserve-tgiserver`), TorchServe, TF Serving, XGBoost, LightGBM, PMML and ONNX Runtime. Custom ServingRuntimes are a 30-line YAML.
Two deployment stacks: Serverless (Knative + Istio) for scale-to-zero and request-level routing, and Raw (vanilla Deployment + HPA + Gateway API) for LLM workloads where cold-start latency makes scale-to-zero impractical. Most 2026 LLM deployments use Raw.
Default LLM serving path inside Yobitel's Yobibyte platform — every InferenceBench-scored vLLM, TensorRT-LLM and Triton endpoint runs through a KServe InferenceService, autoscaled on concurrency and routed via the platform gateway across H100, H200 and B200 tenancies.

Overview

KServe is the Kubernetes-native abstraction for serving machine-learning models. Before it landed, deploying a model on Kubernetes meant assembling a Deployment, Service, HorizontalPodAutoscaler, Ingress, ConfigMap, ServiceAccount and storage-initialiser sidecar yourself for every model. Every team did this differently. KServe collapses the whole thing into a single InferenceService CRD: declare a model URI, a runtime and a scaling envelope, and the controller stands up the right pods, services, routes and scalers in the cluster.

The original project (KFServing) shipped in 2019 under the Kubeflow umbrella. It was renamed KServe and split into a top-level project in 2021 to broaden adoption beyond Kubeflow users, donated to the CNCF Sandbox in 2022 and promoted to Incubating in early 2024. By mid-2026 it sits on v0.14+ with maintainers from Bloomberg, IBM, Red Hat, Google, NVIDIA, Cisco, AWS and Yobitel. The release cadence is roughly quarterly, with patch releases between.

The CRD layer is intentionally runtime-agnostic. KServe does not implement model serving itself — it composes existing runtimes (vLLM, Triton, TorchServe, MLServer, HuggingFace TGI, TF Serving, ONNX Runtime) into a uniform deployment surface. The same InferenceService spec works for a 7B LLM behind vLLM, a vision model behind Triton, an XGBoost classifier behind MLServer or an ensemble pipeline composing several runtimes — and the operator handles the scaling, routing, canary and observability story for all of them.

Yobibyte exposes KServe-compatible Inference resources to customers as the managed service surface — the Yobitel platform reconciles the underlying InferenceService and ServingRuntime objects on the customer's behalf, so consumers never own the CRD lifecycle, autoscaler choice or runtime image pin themselves. This entry documents the production surface: the CRDs, the built-in runtimes, the deployment modes, the workload patterns, sizing, observability, security, migration and troubleshooting, for teams that do operate KServe directly on their own clusters. This entry helps you stand up KServe on your Kubernetes cluster — or recognise what Yobibyte does on your behalf as a managed service. It assumes the cluster already has the NVIDIA GPU Operator installed for GPU-bound workloads.

Quick start

The example below installs KServe in Raw mode (no Knative, no Istio), then deploys Llama 3.1 8B Instruct on a single H100 via the built-in kserve-vllmserver runtime, then issues an OpenAI-compatible chat completion against the resulting endpoint. The first block installs the controller; the second block applies an InferenceService; the third block hits the endpoint with curl.

# 1. Install KServe in Raw mode (recommended for LLM workloads)
KSERVE_VERSION="v0.14.0"

kubectl apply --server-side -f \
    "https://github.com/kserve/kserve/releases/download/$KSERVE_VERSION/kserve.yaml"
kubectl apply --server-side -f \
    "https://github.com/kserve/kserve/releases/download/$KSERVE_VERSION/kserve-cluster-resources.yaml"

# 2. Deploy Llama 3.1 8B Instruct via vLLM ClusterServingRuntime
cat <<'YAML' | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-8b
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: hpa
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4
    scaleTarget: 80
    scaleMetric: concurrency
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-8B-Instruct
        - --max-model-len=16384
        - --quantization=fp8
        - --enable-prefix-caching
      resources:
        limits: { nvidia.com/gpu: 1, cpu: 8, memory: 64Gi }
YAML

# 3. Wait for it to come Ready, then send a request via the OpenAI API
kubectl wait --for=condition=Ready inferenceservice/llama3-8b --timeout=20m

INGRESS=$(kubectl get inferenceservice llama3-8b \
    -o jsonpath='{.status.url}' | sed 's|https://||')

curl -k "https://$INGRESS/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
      "messages": [{"role":"user","content":"Explain KServe in 2 lines."}],
      "max_tokens": 128
    }'

Tip: Skip Knative for LLM workloads. A 70B model takes 60-180 seconds to load weights from S3 onto GPU memory — scale-to-zero turns every cold start into a guaranteed timeout. Set serving.kserve.io/deploymentMode: RawDeployment and use HPA on concurrency.

How it works

KServe is structured as a Kubernetes controller (Go, ~80K lines) that watches the InferenceService CRD and reconciles it into the right set of Knative Services, Deployments, HPAs, VirtualServices / HTTPRoutes, ServiceAccounts and ServingRuntime references. The runtime container itself comes from a ServingRuntime or ClusterServingRuntime resource, which holds the image, command, port spec and arg template — KServe substitutes the InferenceService's model.args and model.storageUri into the template at pod creation.

On the data plane there are two architectures. In Serverless mode (the original) the predictor is a Knative Service, which means request-level autoscaling, scale-to-zero, traffic splitting via VirtualServices, and Istio sidecars enforcing mTLS. In Raw mode (introduced in v0.10, dominant by 2026) the predictor is a vanilla Kubernetes Deployment plus an HPA plus a Gateway API HTTPRoute — same surface, no Knative or Istio dependency. The choice is per-InferenceService via the serving.kserve.io/deploymentMode annotation.

A typical request flow on Raw mode: client → cluster ingress (Envoy / NGINX / cloud LB) → HTTPRoute → predictor Service → predictor Pod → ServingRuntime container (e.g. vLLM listening on :8080) → response. With a Transformer in the spec, the predictor Service points at the transformer pod which then forwards to the predictor pod via the in-cluster network. With ModelMesh enabled, multiple models share predictor pods with LRU-style loading.

Storage is handled by the storage-initializer InitContainer. When the InferenceService specifies model.storageUri: s3://bucket/path, the init container pulls weights into an emptyDir or PVC before the runtime container starts. Supported schemes include s3://, gs://, abfs://, pvc://, hf://, oci:// and http(s)://. Credentials are sourced from a referenced ServiceAccount with mounted secrets or IRSA / Workload Identity bindings.

Reconciliation loop — single controller watches InferenceService, ServingRuntime, ClusterServingRuntime and downstream resources. State lives only in etcd.
Predictor / Transformer / Explainer — three optional components per InferenceService; the controller wires them as a request chain.
ModelMesh — sidecar architecture for thousands of small models with LRU loading; separate controller, same CRD surface.
Storage initializer — pulls weights before the runtime starts; idempotent across pod restarts via emptyDir lifetime.
Open Inference Protocol (v2) — the standard predict/explain gRPC + REST API every runtime adheres to (LLM runtimes also expose OpenAI-compatible endpoints).
Gateway API integration — Raw mode now uses Gateway API HTTPRoutes by default; falls back to Ingress on older clusters.

Note: Serverless mode requires Knative Serving 1.14+ and Istio (or Kourier as a lighter alternative). Raw mode requires only a Gateway API implementation. If you do not already run Istio for other reasons, picking Raw is a smaller blast-radius dependency.

Reference: InferenceService spec

The InferenceService spec has ~50 top-level and nested fields. The table below covers the ones that matter on every deployment. Defaults are taken from v0.14 (mid-2026). The full schema lives at serving.kserve.io/v1beta1; v1alpha1 ModelSpec fields are still supported for ModelMesh use cases.

Field	Type	Default	Purpose
predictor.model.modelFormat.name	string	(required)	huggingface
predictor.model.runtime	string	(auto)	Explicit ServingRuntime reference. Use for LLMs: `kserve-vllmserver`, `kserve-tritonserver`, `kserve-huggingfaceserver`.
predictor.model.storageUri	string	(none)	Where to pull weights from: `s3://`, `gs://`, `hf://`, `pvc://`, `oci://`. For HF runtime use `--model=org/name` arg instead.
predictor.model.args	[]string	[]	Runtime args appended to ServingRuntime container command.
predictor.model.env	[]EnvVar	[]	Environment variables for the runtime container.
predictor.model.resources	ResourceRequirements	(none)	CPU / memory / GPU requests and limits. Required for `nvidia.com/gpu`.
predictor.minReplicas	int	1	Lower bound. Set to 0 only on Serverless mode and only for small / cheap workloads.
predictor.maxReplicas	int	(unbounded)	Upper bound for the HPA / Knative autoscaler.
predictor.scaleTarget	int	(runtime-default)	Target value for the scale metric (e.g. 80 concurrent requests).
predictor.scaleMetric	string	concurrency	concurrency
predictor.containerConcurrency	int	0	Knative-only. Hard cap on concurrent requests per pod; 0 = unlimited.
predictor.timeout	int (s)	60	Request timeout. Raise to 300-600 for long-context LLMs.
predictor.serviceAccountName	string	default	Used by storage-initializer to pull weights from S3/GCS/etc.
predictor.nodeSelector / tolerations / affinity	object	{}	Constrain to GPU node pool or MIG slice profile.
transformer.containers	[]Container	(none)	Custom pre/post-processing pod (tokenisation, image resize). Replaces predictor as the request entry point.
explainer.containers	[]Container	(none)	Interpretability sidecar (Alibi, SHAP, ART).
canaryTrafficPercent	int	0	Percentage of traffic routed to the latest revision; rest stays on default.
spec.predictor.workerSpec	object	(none)	Multi-node worker spec (Ray / MPI) for tensor- or pipeline-parallel inference across nodes.
metadata.annotations.serving.kserve.io/deploymentMode	string	Serverless	RawDeployment
metadata.annotations.serving.kserve.io/autoscalerClass	string	knative	knative
metadata.annotations.serving.kserve.io/storage-initializer-cpu / memory	string	100m / 100Mi	Raise for large model pulls — 70B models need 4Gi+.
metadata.annotations.autoscaling.knative.dev/metric	string	concurrency	Serverless mode override; rps
metadata.annotations.serving.kserve.io/enable-prometheus-scraping	string	true	Enables Prometheus annotations on predictor pods.
metadata.annotations.security.kserve.io/disable-istio-sidecar	string	false	Bypass Istio mTLS on Serverless mode.

Warning: predictor.minReplicas: 0 is tempting and almost always wrong for LLMs. Cold start on a 7B model is 30-90s; on a 70B model it is 2-5 minutes. Either keep one warm or accept guaranteed timeouts on the first request after a scale-down.

Workload patterns

Three patterns cover the bulk of production KServe deployments. First, an OpenAI-compatible LLM endpoint backed by vLLM. Second, canary deployment of a new model version against a live default. Third, an ensemble pipeline composing a Transformer (pre/post-processing) with a Predictor.

Pattern A — OpenAI-compatible LLM endpoint. Use the kserve-vllmserver ClusterServingRuntime, set Raw deployment mode, HPA on concurrency, minReplicas 1+, maxReplicas based on tenant peak. The runtime exposes /v1/chat/completions, /v1/completions, /v1/embeddings and /v1/models paths.

Pattern B — Canary deployment. Submit a second InferenceService revision (same name, new model URI), then set canaryTrafficPercent: 10 on the top-level spec. KServe creates a second predictor and routes 10% of traffic to it. Promote by updating the default to the new revision and setting canary back to 0; roll back by setting it to 0 without promotion.

Pattern C — Transformer + Predictor. Add a transformer spec with a custom container that pre-processes incoming requests (e.g. PDF → text extraction, image → tensor) and forwards to the predictor. KServe wires the request chain transparently; the client sees one endpoint.

# A — OpenAI-compatible LLM endpoint on 4x H100
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: hpa
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 16
    scaleTarget: 64
    scaleMetric: concurrency
    timeout: 600
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-70B-Instruct
        - --tensor-parallel-size=4
        - --max-model-len=32768
        - --quantization=fp8
        - --kv-cache-dtype=fp8
        - --enable-prefix-caching
        - --enable-chunked-prefill
      resources:
        limits: { nvidia.com/gpu: 4, cpu: 32, memory: 256Gi }
---
# B — Canary 10% traffic to a new revision
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
spec:
  canaryTrafficPercent: 10
  predictor:
    minReplicas: 2
    maxReplicas: 16
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-70B-Instruct-v2
        - --tensor-parallel-size=4
      resources:
        limits: { nvidia.com/gpu: 4 }
---
# C — Ensemble: transformer (PDF -> text) + predictor (embeddings)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: pdf-embed }
spec:
  transformer:
    containers:
      - name: pdf-extract
        image: example/pdf-to-text:1.4
        ports: [{ containerPort: 8080 }]
  predictor:
    minReplicas: 2
    maxReplicas: 8
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-huggingfaceserver
      args:
        - --model=BAAI/bge-m3
        - --task=feature-extraction
      resources:
        limits: { nvidia.com/gpu: 1 }

Tip: Pattern A's prefix-cache hit rate is the single biggest cost lever. If multiple tenants share a system prompt, send them through the same InferenceService and let vLLM hash-share the cached prefix. If they must not, run one InferenceService per tenant and accept the cache miss.

Sizing and capacity planning

KServe's own footprint is small — controller pod ~150 mCPU + 256 MiB, plus a webhook pod, plus the storage-initializer init container which runs once per pod start. The real sizing question is the runtime: vLLM, Triton, TGI sized per their own footprint, plus an HPA envelope. The table below is the planning model for typical LLM serving on H100 / H200 / B200, assuming the predictor is vLLM with FP8 weights and KV.

minReplicas ≥ 2 for any production LLM endpoint — a single replica means every pod restart drops the endpoint, and HPA cannot scale below 1 in time to absorb a request spike.
Storage-initializer needs 4-8 GiB CPU memory for 70B model pulls; raise via annotations. Default 100Mi will fail.
Plan ingress capacity for the peak maxReplicas x scaleTarget request count. A 16-replica deployment scaling on 64 concurrent requests handles 1,024 concurrent requests at peak — your gateway must cope.
Per-pod cold start dominates the user-visible scale-up time: 30-90s for 7B, 2-5 minutes for 70B. Pre-warm a buffer replica during predictable traffic peaks.

Model	Runtime	Hardware	minReplicas	maxReplicas	scaleTarget	Per-pod tok/s
Llama 3.1 8B	kserve-vllmserver	1x H100 SXM5	1	8	80	3,800-5,200
Llama 3.1 70B	kserve-vllmserver	4x H100 SXM5	2	16	64	2,800-4,200
Llama 3.1 70B (high QPS)	kserve-vllmserver	8x H100 SXM5	2	8	128	5,200-7,800
Llama 3.1 70B (128K ctx)	kserve-vllmserver	2x H200 141GB	1	6	32	1,400-2,200
Mixtral 8x22B	kserve-vllmserver	8x H100 SXM5	2	8	96	4,500-6,800
Llama 3.1 70B (Blackwell)	kserve-vllmserver	4x B200	2	8	128	6,800-10,500
BGE-M3 embeddings	kserve-huggingfaceserver	1x L40S 48GB	2	16	200	n/a (4k req/s)
XGBoost classifier	kserve-mlserver	CPU (4 vCPU)	2	32	300 rps	n/a
ResNet50 vision	kserve-tritonserver	1x L4 24GB	2	16	150	n/a (500 img/s)

Limits and quotas

KServe inherits Kubernetes' limit / quota model. The CRD itself imposes few hard limits; the practical ceilings come from etcd object size, gateway concurrency, and the underlying runtime.

Limit	Default / ceiling	How to raise
InferenceServices per namespace	ResourceQuota-bounded	Set `count/inferenceservices.serving.kserve.io` in ResourceQuota.
Predictor pod resource size	Cluster maxPodResources	Node capacity-bounded; ensure node pool has matching SKUs.
Predictor replicas	HPA-bounded (default 100)	Set `--horizontal-pod-autoscaler-cpu-initialization-period` and raise HPA max.
InferenceService spec size	etcd 1.5 MiB	Keep `args`/`env` modest; avoid embedding large config in CR.
Storage-initializer pull size	PVC / emptyDir size	Use PVC with explicit size; emptyDir uses node ephemeral storage.
Request body size	Gateway-bounded	Configure ingress / Gateway API. Default Envoy is 1 MiB; raise for batch.
Request timeout	60s default	Raise `predictor.timeout`; ensure gateway timeout matches.
Concurrent revisions	Knative quota	Serverless only; configure via `config-defaults` ConfigMap.
ServingRuntimes per cluster	etcd-bounded	ClusterServingRuntime CRDs are cheap; no practical limit.
ModelMesh models per pod	Runtime-defined (~50-200)	Tune via ModelMesh `ServingRuntime` `multiModel: true` config.

Warning: The default request timeout (60s) is shorter than 70B LLM completion latency for long outputs. Raise predictor.timeout to 300-600s on every LLM InferenceService, and raise the gateway / ingress timeout to match — otherwise the client sees a 504 even when the model is still generating.

Observability

KServe exposes Prometheus metrics from three sources: the controller (kserve_controller_* reconcile counts), the predictor pod (runtime-specific — vllm:*, triton:*, mlserver:*), and the in-cluster gateway. The InferenceService status reports current and desired replicas, traffic split between default and canary, and the live URL. Standard Grafana dashboards (KServe-published 11 and 12239 for DCGM alongside) give a turnkey view. For LLMs, pair runtime metrics with DCGM GPU metrics — a queue depth spike that does not correlate with GPU utilisation usually means the bottleneck is elsewhere (storage, gateway, tokenisation).

kserve_controller_reconcile_total / _errors_total — controller-side health.
kserve_inference_service_status_ready — gauge of Ready InferenceServices per namespace.
predictor request_count / request_duration_seconds — surfaced by every runtime via the Open Inference Protocol metrics path.
vllm:time_to_first_token_seconds / vllm:gpu_cache_usage_perc — for vLLM-backed predictors.
nv_inference_request_duration_us / nv_inference_queue_duration_us — for Triton-backed predictors.
DCGM_FI_DEV_GPU_UTIL — pair with predictor metrics to distinguish compute, memory and idle bottlenecks.
envoy_cluster_upstream_rq_time / kserve_request_count — gateway-side latency and rate.

# Prometheus alerts for a KServe deployment
groups:
  - name: kserve-sla
    interval: 30s
    rules:
      - alert: InferenceServiceNotReady
        expr: kserve_inference_service_status_ready == 0
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "InferenceService {{ $labels.name }} in {{ $labels.namespace }} not Ready"

      - alert: KServeControllerReconcileFailing
        expr: rate(kserve_controller_reconcile_errors_total[10m]) > 0
        for: 15m
        labels: { severity: critical }

      - alert: PredictorPodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total{namespace=~"serving|kserve-.*"}[15m]) > 0.2
        for: 10m
        labels: { severity: warning }

      - alert: VLLMPredictorTTFTHigh
        expr: histogram_quantile(0.95,
                sum by (le, model_name) (
                  rate(vllm:time_to_first_token_seconds_bucket[5m]))) > 1.0
        for: 5m
        labels: { severity: warning }

      - alert: CanaryRolloutDegraded
        expr: kserve_revision_request_error_rate{revision="canary"} > 0.05
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Canary revision error rate >5% — consider rollback"

Tip: Wire the predictor's runtime metrics, gateway metrics and DCGM metrics into the same dashboard from day one. The most common production confusion is debugging a latency spike with only one of the three in view — answers always require all three.

Cost and FinOps

KServe is free (Apache 2.0); the cost surface is the runtime hardware and the autoscaling envelope. Two levers dominate: how aggressively you scale (replica count x duration) and how efficiently each replica converts GPU-time to served tokens. The table below uses representative cloud GPU rates and InferenceBench throughput anchors to translate KServe deployment shapes into $/M token costs.

Set minReplicas to actual steady-state demand divided by per-pod capacity, not to 1. Under-provisioning means scale-up cold start = user-visible latency.
HPA scaleTarget directly trades off cost vs latency tail. Higher target = fewer pods = lower cost but longer p99. Pick deliberately, not by default.
Canary deployments cost the full predictor — a 10% canary on a 4-replica default is still 4 extra GPUs of spend. Time-box every canary.
Spot or pre-emptible GPU node pools cut hourly rates 40-60% but require predictor.maxReplicas headroom to absorb pre-emption events without dropping SLA.
ModelMesh changes the math for thousands of small models — co-location amortises per-pod overhead and can cut cost 10-50x for classical ML model fleets.

Configuration	Replicas	GPU rate ($/h)	Sustained tok/s	$/M output tokens
Llama 3.1 8B, kserve-vllmserver, 1x H100	1-4	$3.20	4,500	$0.20
Llama 3.1 70B, kserve-vllmserver, 4x H100	2-8	$12.40	3,500	$0.98
Llama 3.1 70B, kserve-vllmserver, 8x H100	2-4	$24.80	6,800	$1.01
Llama 3.1 70B (128K), kserve-vllmserver, 2x H200	1-3	$8.40	1,800	$1.30
Mixtral 8x22B, kserve-vllmserver, 8x H100	2-4	$24.80	6,200	$1.11
Llama 3.1 70B FP4, kserve-vllmserver, 4x B200	2-6	$22.00	9,200	$0.66
Llama 3.1 70B, kserve-tritonserver + TRT-LLM, 4x H100	2-4	$12.40	4,200	$0.82
BGE-M3 embeddings, kserve-huggingfaceserver, 1x L40S	2-8	$1.40	n/a	$0.05/M tokens embedded

Security and compliance

KServe inherits Kubernetes' RBAC, NetworkPolicy and Pod Security Standards. On Serverless mode, Istio sidecars provide mTLS between predictor pods and any upstream consumer. On Raw mode, the cluster's Gateway API implementation (Envoy Gateway, Istio Gateway, NGINX Gateway Fabric) handles TLS termination and authentication — typically OIDC or signed-JWT at the gateway, with KServe itself unaware of identity. The Open Inference Protocol does not specify auth; that is the gateway's job.

Model artifacts are pulled by the storage-initializer using credentials sourced from the ServiceAccount on the predictor pod. On AWS this is typically IRSA (IAM Roles for Service Accounts); on GCP it is Workload Identity; on Azure it is AAD Pod Identity. The pulled weights live in an emptyDir (RAM-backed or disk) and are deleted when the pod terminates. Avoid PVC-backed weight stores unless you specifically need them shared across pods; emptyDir + storage-initializer is the cleaner model for most workloads.

Regulatory implications are workload-, data- and deployment-specific. For UK NCSC Cloud Security Principles, KServe is a control-plane component; principles 2 (Asset protection — encrypt weights at rest in object storage), 3 (Separation between users — namespace + NetworkPolicy isolation between tenants), 5 (Operational security — GitOps + versioned InferenceServices) and 9 (Secure user management — gateway-layer OIDC) are the relevant ones. For GDPR Article 32, the predictor processes prompts and completions only in pod memory; ensure logging redacts PII. For HIPAA, deploy inside a BAA-covered VPC and disable the request-body capture in runtime logs.

Warning: Default Knative configuration logs request bodies in the activator at debug level. For PII / PHI workloads on Serverless mode, set logging.request-log-template="" in the Knative config-observability ConfigMap to suppress request-body capture entirely.

Migration and alternatives

Most production migrations to KServe come from one of four origins: raw Deployment + Service + HPA, Seldon Core, BentoML, or a managed SaaS API (SageMaker, Vertex, Bedrock). The first delivers operational simplification with little behaviour change; the second is a near-equivalent CRD swap; the third trades Pythonic bento ergonomics for Kubernetes-native operations; the fourth trades cost for control. The table summarises the path from each starting point.

From	Effort	Trade-offs	Notes
Raw Deployment + Service + HPA	Low	Lose hand-tuned flexibility, gain autoscaling + canary + GitOps fit	Wrap existing container in a ServingRuntime; same image, same args.
Seldon Core v1 (Apache 2.0)	Low-medium	Lose Seldon's inference-graph DAG; gain CNCF governance + LLM-focused runtimes	InferenceGraph not 1:1; rebuild multi-step routing in the gateway or Ray Serve.
Seldon Core v2 (BSL)	Medium	Same as v1 plus give up MLServer co-location optimisations	ModelMesh covers similar ground for many-small-model fleets.
BentoML / Bento Cloud	Medium	Lose bento packaging ergonomics; gain Kubernetes-native operations	Bento services map cleanly to ServingRuntime + InferenceService.
SageMaker / Vertex / Bedrock	High	Gain control, sovereignty, on-prem option; lose hosted model variety	Re-platform to KServe + vLLM or Triton; expect 30-70% cost reduction at scale.
NVIDIA Triton standalone	Trivial	Gain CRD lifecycle; lose nothing	Use `kserve-tritonserver` ClusterServingRuntime; existing model repo works unchanged.
TorchServe standalone	Low	Gain CRD + autoscaling; lose nothing	Use `kserve-torchserve` runtime; same `.mar` archives work.
Hugging Face Inference Endpoints	Medium	Gain sovereignty + cost control; lose managed convenience	Use `kserve-huggingfaceserver` or `kserve-tgiserver`.
vs Yobibyte managed alternative	n/a	Keep managed convenience and gain UK / EU sovereignty; give up direct CRD ownership	Yobibyte exposes a KServe-compatible Inference resource without customers operating KServe themselves — the InferenceService, ServingRuntime, autoscaler and gateway are reconciled by the Yobitel platform across NeoCloud regions.

# Migration: from a raw Deployment + Service to KServe
# Before — hand-rolled vLLM Deployment + Service + HPA + Ingress (several YAMLs)
# After  — single InferenceService

cat <<'YAML' | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
  namespace: ml-platform
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: hpa
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 8
    scaleTarget: 64
    scaleMetric: concurrency
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-70B-Instruct
        - --tensor-parallel-size=4
        - --quantization=fp8
        - --enable-prefix-caching
      resources:
        limits: { nvidia.com/gpu: 4 }
YAML

# Cut over traffic at the gateway; old Deployment can be deleted once verified
kubectl delete deployment/llama3-70b-vllm service/llama3-70b-vllm \
    hpa/llama3-70b-vllm ingress/llama3-70b-vllm

Troubleshooting

The error table below covers the failure modes that account for the bulk of production KServe incidents observed on Yobitel-operated fleets and the upstream issue tracker. Each row maps an observable symptom to the underlying mechanism and the minimum-viable fix.

Symptom	Cause	Fix
InferenceService stuck in `LatestDeploymentReady=False`	Predictor pod not scheduling — GPU not available or runtime image pull failing.	Check predictor pod events; verify `nvidia.com/gpu` request matches node availability; check image pull secrets.
Cold start times out client request	Predictor minReplicas=0 + slow model load.	Set `minReplicas: 1`; raise client / gateway timeout to 300-600s; pre-warm before traffic cutover.
`storage-initializer` OOMKilled	Default 100Mi memory too small for large model pull.	Set `serving.kserve.io/storage-initializer-memory: 8Gi` annotation.
Storage URI auth failure (S3 403)	ServiceAccount missing IRSA / Workload Identity binding.	Annotate SA with the IAM role / GSA; ensure the role has `s3:GetObject` on the bucket prefix.
Runtime not found error	ServingRuntime / ClusterServingRuntime missing in cluster.	Apply `kserve-cluster-resources.yaml` from the release; verify `kubectl get clusterservingruntimes`.
Predictor scaling but never serving	HPA scaling on CPU instead of concurrency.	Set `serving.kserve.io/autoscalerClass: hpa` + `scaleMetric: concurrency` and ensure metrics-server / Prometheus Adapter is configured.
Canary traffic split not taking effect	Gateway implementation doesn't support weighted routes.	Verify Gateway API implementation supports HTTPRoute weights; or switch to Serverless mode + Knative.
Inference latency much worse than standalone runtime benchmark	Istio sidecar overhead on Serverless mode.	Disable Istio sidecar via annotation, or switch to Raw mode.
InferenceService URL returns 404 from outside the cluster	Gateway not exposed externally, or hostname not configured.	Check Gateway / Ingress status; verify DNS points to gateway external IP.
Multi-node predictor (workerSpec) hangs at NCCL init	/dev/shm too small or worker pods not on same NVLink island.	Mount `/dev/shm >= 8Gi`; add pod affinity for same node pool.
ModelMesh runtime returns `MODEL_NOT_LOADED`	LRU evicted the model; load timing too long.	Raise `modelTTL` on ServingRuntime; raise `multiModel.replicas`.
Predictor pod logs say `OPENAI_API_KEY required`	vLLM runtime args missing api-key disable.	Add `--api-key=EMPTY` to runtime args, or set `OPENAI_API_KEY` env.

Where this fits in the Yobitel stack

KServe is the default model-serving abstraction inside Yobitel's Yobibyte platform. Every inference endpoint a customer deploys through Yobibyte — whether it lands on vLLM, TensorRT-LLM under Triton, or a HuggingFace embedding runtime — runs as a KServe InferenceService underneath. The Yobibyte control plane is what translates the customer-facing Workspace + Inference primitives into the InferenceService + ServingRuntime objects on the underlying cluster, then handles autoscaling envelopes, prefix-cache-aware routing across replicas, multi-tenant network isolation and FOCUS-conformant cost attribution.

Omniscient Compute scores KServe-deployed runtimes continuously on InferenceBench v3 across H100, H200, B200 and MI300X tenancies, with each ClusterServingRuntime + flag combination measured at fixed input/output token mixes (chat, RAG, long-context, batch). The recommended scaleTarget, replica count and runtime args on the Yobibyte console come from an InferenceBench measurement, not a vendor datasheet.

For UK and EU sovereign workloads, KServe runs on the Yobitel London-1 and Frankfurt-1 regions inside tenancies that satisfy NCSC Cloud Security Principles, G-Cloud 14 lot definitions and the OFFICIAL handling caveat. The combination of an Apache 2.0 CRD layer, open-source runtimes, sovereign hardware and transparent benchmarking is what lets Yobitel customers deploy production model endpoints on Kubernetes without ceding control, latency budget or cost transparency to a hosted SaaS API.

References

KServe Documentation · KServe Project
kserve on GitHub · GitHub
CNCF KServe Project Page · CNCF
Open Inference Protocol · GitHub
ModelMesh Serving · GitHub
Knative Serving · Knative Project
Gateway API · Kubernetes SIGs

TL;DR

Open-source Kubernetes-native model serving framework originally launched as KFServing under Kubeflow in 2019, renamed to KServe and split into a top-level project in 2021, donated to the CNCF Sandbox in 2022 and promoted to Incubating in 2024. Apache 2.0, governed by an open Steering Committee with maintainers from Bloomberg, IBM, Red Hat, Google, NVIDIA, Cisco, AWS and Yobitel.
Built around the InferenceService CRD — a single declarative spec covering predictor, transformer, explainer, autoscaling, traffic splitting, canary, storage initialiser, OIDC auth and OpenAI-compatible LLM endpoints — and the ServingRuntime / ClusterServingRuntime CRDs that pin runtime images and arg templates.
Ships built-in runtimes for vLLM (`kserve-vllmserver`), Triton (`kserve-tritonserver`), MLServer (`kserve-mlserver`), HuggingFace TGI / native (`kserve-huggingfaceserver`, `kserve-tgiserver`), TorchServe, TF Serving, XGBoost, LightGBM, PMML and ONNX Runtime. Custom ServingRuntimes are a 30-line YAML.
Two deployment stacks: Serverless (Knative + Istio) for scale-to-zero and request-level routing, and Raw (vanilla Deployment + HPA + Gateway API) for LLM workloads where cold-start latency makes scale-to-zero impractical. Most 2026 LLM deployments use Raw.
Default LLM serving path inside Yobitel's Yobibyte platform — every InferenceBench-scored vLLM, TensorRT-LLM and Triton endpoint runs through a KServe InferenceService, autoscaled on concurrency and routed via the platform gateway across H100, H200 and B200 tenancies.

Overview

Quick start

# 1. Install KServe in Raw mode (recommended for LLM workloads)
KSERVE_VERSION="v0.14.0"

kubectl apply --server-side -f \
    "https://github.com/kserve/kserve/releases/download/$KSERVE_VERSION/kserve.yaml"
kubectl apply --server-side -f \
    "https://github.com/kserve/kserve/releases/download/$KSERVE_VERSION/kserve-cluster-resources.yaml"

# 2. Deploy Llama 3.1 8B Instruct via vLLM ClusterServingRuntime
cat <<'YAML' | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-8b
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: hpa
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 4
    scaleTarget: 80
    scaleMetric: concurrency
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-8B-Instruct
        - --max-model-len=16384
        - --quantization=fp8
        - --enable-prefix-caching
      resources:
        limits: { nvidia.com/gpu: 1, cpu: 8, memory: 64Gi }
YAML

# 3. Wait for it to come Ready, then send a request via the OpenAI API
kubectl wait --for=condition=Ready inferenceservice/llama3-8b --timeout=20m

INGRESS=$(kubectl get inferenceservice llama3-8b \
    -o jsonpath='{.status.url}' | sed 's|https://||')

curl -k "https://$INGRESS/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
      "messages": [{"role":"user","content":"Explain KServe in 2 lines."}],
      "max_tokens": 128
    }'

Tip: Skip Knative for LLM workloads. A 70B model takes 60-180 seconds to load weights from S3 onto GPU memory — scale-to-zero turns every cold start into a guaranteed timeout. Set serving.kserve.io/deploymentMode: RawDeployment and use HPA on concurrency.

How it works

Reconciliation loop — single controller watches InferenceService, ServingRuntime, ClusterServingRuntime and downstream resources. State lives only in etcd.
Predictor / Transformer / Explainer — three optional components per InferenceService; the controller wires them as a request chain.
ModelMesh — sidecar architecture for thousands of small models with LRU loading; separate controller, same CRD surface.
Storage initializer — pulls weights before the runtime starts; idempotent across pod restarts via emptyDir lifetime.
Open Inference Protocol (v2) — the standard predict/explain gRPC + REST API every runtime adheres to (LLM runtimes also expose OpenAI-compatible endpoints).
Gateway API integration — Raw mode now uses Gateway API HTTPRoutes by default; falls back to Ingress on older clusters.

Note: Serverless mode requires Knative Serving 1.14+ and Istio (or Kourier as a lighter alternative). Raw mode requires only a Gateway API implementation. If you do not already run Istio for other reasons, picking Raw is a smaller blast-radius dependency.

Reference: InferenceService spec

Field	Type	Default	Purpose
predictor.model.modelFormat.name	string	(required)	huggingface
predictor.model.runtime	string	(auto)	Explicit ServingRuntime reference. Use for LLMs: `kserve-vllmserver`, `kserve-tritonserver`, `kserve-huggingfaceserver`.
predictor.model.storageUri	string	(none)	Where to pull weights from: `s3://`, `gs://`, `hf://`, `pvc://`, `oci://`. For HF runtime use `--model=org/name` arg instead.
predictor.model.args	[]string	[]	Runtime args appended to ServingRuntime container command.
predictor.model.env	[]EnvVar	[]	Environment variables for the runtime container.
predictor.model.resources	ResourceRequirements	(none)	CPU / memory / GPU requests and limits. Required for `nvidia.com/gpu`.
predictor.minReplicas	int	1	Lower bound. Set to 0 only on Serverless mode and only for small / cheap workloads.
predictor.maxReplicas	int	(unbounded)	Upper bound for the HPA / Knative autoscaler.
predictor.scaleTarget	int	(runtime-default)	Target value for the scale metric (e.g. 80 concurrent requests).
predictor.scaleMetric	string	concurrency	concurrency
predictor.containerConcurrency	int	0	Knative-only. Hard cap on concurrent requests per pod; 0 = unlimited.
predictor.timeout	int (s)	60	Request timeout. Raise to 300-600 for long-context LLMs.
predictor.serviceAccountName	string	default	Used by storage-initializer to pull weights from S3/GCS/etc.
predictor.nodeSelector / tolerations / affinity	object	{}	Constrain to GPU node pool or MIG slice profile.
transformer.containers	[]Container	(none)	Custom pre/post-processing pod (tokenisation, image resize). Replaces predictor as the request entry point.
explainer.containers	[]Container	(none)	Interpretability sidecar (Alibi, SHAP, ART).
canaryTrafficPercent	int	0	Percentage of traffic routed to the latest revision; rest stays on default.
spec.predictor.workerSpec	object	(none)	Multi-node worker spec (Ray / MPI) for tensor- or pipeline-parallel inference across nodes.
metadata.annotations.serving.kserve.io/deploymentMode	string	Serverless	RawDeployment
metadata.annotations.serving.kserve.io/autoscalerClass	string	knative	knative
metadata.annotations.serving.kserve.io/storage-initializer-cpu / memory	string	100m / 100Mi	Raise for large model pulls — 70B models need 4Gi+.
metadata.annotations.autoscaling.knative.dev/metric	string	concurrency	Serverless mode override; rps
metadata.annotations.serving.kserve.io/enable-prometheus-scraping	string	true	Enables Prometheus annotations on predictor pods.
metadata.annotations.security.kserve.io/disable-istio-sidecar	string	false	Bypass Istio mTLS on Serverless mode.

Warning: predictor.minReplicas: 0 is tempting and almost always wrong for LLMs. Cold start on a 7B model is 30-90s; on a 70B model it is 2-5 minutes. Either keep one warm or accept guaranteed timeouts on the first request after a scale-down.

Workload patterns

# A — OpenAI-compatible LLM endpoint on 4x H100
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: hpa
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 16
    scaleTarget: 64
    scaleMetric: concurrency
    timeout: 600
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-70B-Instruct
        - --tensor-parallel-size=4
        - --max-model-len=32768
        - --quantization=fp8
        - --kv-cache-dtype=fp8
        - --enable-prefix-caching
        - --enable-chunked-prefill
      resources:
        limits: { nvidia.com/gpu: 4, cpu: 32, memory: 256Gi }
---
# B — Canary 10% traffic to a new revision
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
spec:
  canaryTrafficPercent: 10
  predictor:
    minReplicas: 2
    maxReplicas: 16
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-70B-Instruct-v2
        - --tensor-parallel-size=4
      resources:
        limits: { nvidia.com/gpu: 4 }
---
# C — Ensemble: transformer (PDF -> text) + predictor (embeddings)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: pdf-embed }
spec:
  transformer:
    containers:
      - name: pdf-extract
        image: example/pdf-to-text:1.4
        ports: [{ containerPort: 8080 }]
  predictor:
    minReplicas: 2
    maxReplicas: 8
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-huggingfaceserver
      args:
        - --model=BAAI/bge-m3
        - --task=feature-extraction
      resources:
        limits: { nvidia.com/gpu: 1 }

Tip: Pattern A's prefix-cache hit rate is the single biggest cost lever. If multiple tenants share a system prompt, send them through the same InferenceService and let vLLM hash-share the cached prefix. If they must not, run one InferenceService per tenant and accept the cache miss.

Sizing and capacity planning

minReplicas ≥ 2 for any production LLM endpoint — a single replica means every pod restart drops the endpoint, and HPA cannot scale below 1 in time to absorb a request spike.
Storage-initializer needs 4-8 GiB CPU memory for 70B model pulls; raise via annotations. Default 100Mi will fail.
Plan ingress capacity for the peak maxReplicas x scaleTarget request count. A 16-replica deployment scaling on 64 concurrent requests handles 1,024 concurrent requests at peak — your gateway must cope.
Per-pod cold start dominates the user-visible scale-up time: 30-90s for 7B, 2-5 minutes for 70B. Pre-warm a buffer replica during predictable traffic peaks.

Model	Runtime	Hardware	minReplicas	maxReplicas	scaleTarget	Per-pod tok/s
Llama 3.1 8B	kserve-vllmserver	1x H100 SXM5	1	8	80	3,800-5,200
Llama 3.1 70B	kserve-vllmserver	4x H100 SXM5	2	16	64	2,800-4,200
Llama 3.1 70B (high QPS)	kserve-vllmserver	8x H100 SXM5	2	8	128	5,200-7,800
Llama 3.1 70B (128K ctx)	kserve-vllmserver	2x H200 141GB	1	6	32	1,400-2,200
Mixtral 8x22B	kserve-vllmserver	8x H100 SXM5	2	8	96	4,500-6,800
Llama 3.1 70B (Blackwell)	kserve-vllmserver	4x B200	2	8	128	6,800-10,500
BGE-M3 embeddings	kserve-huggingfaceserver	1x L40S 48GB	2	16	200	n/a (4k req/s)
XGBoost classifier	kserve-mlserver	CPU (4 vCPU)	2	32	300 rps	n/a
ResNet50 vision	kserve-tritonserver	1x L4 24GB	2	16	150	n/a (500 img/s)

Limits and quotas

KServe inherits Kubernetes' limit / quota model. The CRD itself imposes few hard limits; the practical ceilings come from etcd object size, gateway concurrency, and the underlying runtime.

Limit	Default / ceiling	How to raise
InferenceServices per namespace	ResourceQuota-bounded	Set `count/inferenceservices.serving.kserve.io` in ResourceQuota.
Predictor pod resource size	Cluster maxPodResources	Node capacity-bounded; ensure node pool has matching SKUs.
Predictor replicas	HPA-bounded (default 100)	Set `--horizontal-pod-autoscaler-cpu-initialization-period` and raise HPA max.
InferenceService spec size	etcd 1.5 MiB	Keep `args`/`env` modest; avoid embedding large config in CR.
Storage-initializer pull size	PVC / emptyDir size	Use PVC with explicit size; emptyDir uses node ephemeral storage.
Request body size	Gateway-bounded	Configure ingress / Gateway API. Default Envoy is 1 MiB; raise for batch.
Request timeout	60s default	Raise `predictor.timeout`; ensure gateway timeout matches.
Concurrent revisions	Knative quota	Serverless only; configure via `config-defaults` ConfigMap.
ServingRuntimes per cluster	etcd-bounded	ClusterServingRuntime CRDs are cheap; no practical limit.
ModelMesh models per pod	Runtime-defined (~50-200)	Tune via ModelMesh `ServingRuntime` `multiModel: true` config.

Warning: The default request timeout (60s) is shorter than 70B LLM completion latency for long outputs. Raise predictor.timeout to 300-600s on every LLM InferenceService, and raise the gateway / ingress timeout to match — otherwise the client sees a 504 even when the model is still generating.

Observability

kserve_controller_reconcile_total / _errors_total — controller-side health.
kserve_inference_service_status_ready — gauge of Ready InferenceServices per namespace.
predictor request_count / request_duration_seconds — surfaced by every runtime via the Open Inference Protocol metrics path.
vllm:time_to_first_token_seconds / vllm:gpu_cache_usage_perc — for vLLM-backed predictors.
nv_inference_request_duration_us / nv_inference_queue_duration_us — for Triton-backed predictors.
DCGM_FI_DEV_GPU_UTIL — pair with predictor metrics to distinguish compute, memory and idle bottlenecks.
envoy_cluster_upstream_rq_time / kserve_request_count — gateway-side latency and rate.

# Prometheus alerts for a KServe deployment
groups:
  - name: kserve-sla
    interval: 30s
    rules:
      - alert: InferenceServiceNotReady
        expr: kserve_inference_service_status_ready == 0
        for: 10m
        labels: { severity: critical }
        annotations:
          summary: "InferenceService {{ $labels.name }} in {{ $labels.namespace }} not Ready"

      - alert: KServeControllerReconcileFailing
        expr: rate(kserve_controller_reconcile_errors_total[10m]) > 0
        for: 15m
        labels: { severity: critical }

      - alert: PredictorPodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total{namespace=~"serving|kserve-.*"}[15m]) > 0.2
        for: 10m
        labels: { severity: warning }

      - alert: VLLMPredictorTTFTHigh
        expr: histogram_quantile(0.95,
                sum by (le, model_name) (
                  rate(vllm:time_to_first_token_seconds_bucket[5m]))) > 1.0
        for: 5m
        labels: { severity: warning }

      - alert: CanaryRolloutDegraded
        expr: kserve_revision_request_error_rate{revision="canary"} > 0.05
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Canary revision error rate >5% — consider rollback"

Tip: Wire the predictor's runtime metrics, gateway metrics and DCGM metrics into the same dashboard from day one. The most common production confusion is debugging a latency spike with only one of the three in view — answers always require all three.

Cost and FinOps

Set minReplicas to actual steady-state demand divided by per-pod capacity, not to 1. Under-provisioning means scale-up cold start = user-visible latency.
HPA scaleTarget directly trades off cost vs latency tail. Higher target = fewer pods = lower cost but longer p99. Pick deliberately, not by default.
Canary deployments cost the full predictor — a 10% canary on a 4-replica default is still 4 extra GPUs of spend. Time-box every canary.
Spot or pre-emptible GPU node pools cut hourly rates 40-60% but require predictor.maxReplicas headroom to absorb pre-emption events without dropping SLA.
ModelMesh changes the math for thousands of small models — co-location amortises per-pod overhead and can cut cost 10-50x for classical ML model fleets.

Configuration	Replicas	GPU rate ($/h)	Sustained tok/s	$/M output tokens
Llama 3.1 8B, kserve-vllmserver, 1x H100	1-4	$3.20	4,500	$0.20
Llama 3.1 70B, kserve-vllmserver, 4x H100	2-8	$12.40	3,500	$0.98
Llama 3.1 70B, kserve-vllmserver, 8x H100	2-4	$24.80	6,800	$1.01
Llama 3.1 70B (128K), kserve-vllmserver, 2x H200	1-3	$8.40	1,800	$1.30
Mixtral 8x22B, kserve-vllmserver, 8x H100	2-4	$24.80	6,200	$1.11
Llama 3.1 70B FP4, kserve-vllmserver, 4x B200	2-6	$22.00	9,200	$0.66
Llama 3.1 70B, kserve-tritonserver + TRT-LLM, 4x H100	2-4	$12.40	4,200	$0.82
BGE-M3 embeddings, kserve-huggingfaceserver, 1x L40S	2-8	$1.40	n/a	$0.05/M tokens embedded

Security and compliance

Warning: Default Knative configuration logs request bodies in the activator at debug level. For PII / PHI workloads on Serverless mode, set logging.request-log-template="" in the Knative config-observability ConfigMap to suppress request-body capture entirely.

Migration and alternatives

From	Effort	Trade-offs	Notes
Raw Deployment + Service + HPA	Low	Lose hand-tuned flexibility, gain autoscaling + canary + GitOps fit	Wrap existing container in a ServingRuntime; same image, same args.
Seldon Core v1 (Apache 2.0)	Low-medium	Lose Seldon's inference-graph DAG; gain CNCF governance + LLM-focused runtimes	InferenceGraph not 1:1; rebuild multi-step routing in the gateway or Ray Serve.
Seldon Core v2 (BSL)	Medium	Same as v1 plus give up MLServer co-location optimisations	ModelMesh covers similar ground for many-small-model fleets.
BentoML / Bento Cloud	Medium	Lose bento packaging ergonomics; gain Kubernetes-native operations	Bento services map cleanly to ServingRuntime + InferenceService.
SageMaker / Vertex / Bedrock	High	Gain control, sovereignty, on-prem option; lose hosted model variety	Re-platform to KServe + vLLM or Triton; expect 30-70% cost reduction at scale.
NVIDIA Triton standalone	Trivial	Gain CRD lifecycle; lose nothing	Use `kserve-tritonserver` ClusterServingRuntime; existing model repo works unchanged.
TorchServe standalone	Low	Gain CRD + autoscaling; lose nothing	Use `kserve-torchserve` runtime; same `.mar` archives work.
Hugging Face Inference Endpoints	Medium	Gain sovereignty + cost control; lose managed convenience	Use `kserve-huggingfaceserver` or `kserve-tgiserver`.
vs Yobibyte managed alternative	n/a	Keep managed convenience and gain UK / EU sovereignty; give up direct CRD ownership	Yobibyte exposes a KServe-compatible Inference resource without customers operating KServe themselves — the InferenceService, ServingRuntime, autoscaler and gateway are reconciled by the Yobitel platform across NeoCloud regions.

# Migration: from a raw Deployment + Service to KServe
# Before — hand-rolled vLLM Deployment + Service + HPA + Ingress (several YAMLs)
# After  — single InferenceService

cat <<'YAML' | kubectl apply -f -
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-70b
  namespace: ml-platform
  annotations:
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/autoscalerClass: hpa
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 8
    scaleTarget: 64
    scaleMetric: concurrency
    model:
      modelFormat: { name: huggingface }
      runtime: kserve-vllmserver
      args:
        - --model=meta-llama/Meta-Llama-3.1-70B-Instruct
        - --tensor-parallel-size=4
        - --quantization=fp8
        - --enable-prefix-caching
      resources:
        limits: { nvidia.com/gpu: 4 }
YAML

# Cut over traffic at the gateway; old Deployment can be deleted once verified
kubectl delete deployment/llama3-70b-vllm service/llama3-70b-vllm \
    hpa/llama3-70b-vllm ingress/llama3-70b-vllm

Troubleshooting

Symptom	Cause	Fix
InferenceService stuck in `LatestDeploymentReady=False`	Predictor pod not scheduling — GPU not available or runtime image pull failing.	Check predictor pod events; verify `nvidia.com/gpu` request matches node availability; check image pull secrets.
Cold start times out client request	Predictor minReplicas=0 + slow model load.	Set `minReplicas: 1`; raise client / gateway timeout to 300-600s; pre-warm before traffic cutover.
`storage-initializer` OOMKilled	Default 100Mi memory too small for large model pull.	Set `serving.kserve.io/storage-initializer-memory: 8Gi` annotation.
Storage URI auth failure (S3 403)	ServiceAccount missing IRSA / Workload Identity binding.	Annotate SA with the IAM role / GSA; ensure the role has `s3:GetObject` on the bucket prefix.
Runtime not found error	ServingRuntime / ClusterServingRuntime missing in cluster.	Apply `kserve-cluster-resources.yaml` from the release; verify `kubectl get clusterservingruntimes`.
Predictor scaling but never serving	HPA scaling on CPU instead of concurrency.	Set `serving.kserve.io/autoscalerClass: hpa` + `scaleMetric: concurrency` and ensure metrics-server / Prometheus Adapter is configured.
Canary traffic split not taking effect	Gateway implementation doesn't support weighted routes.	Verify Gateway API implementation supports HTTPRoute weights; or switch to Serverless mode + Knative.
Inference latency much worse than standalone runtime benchmark	Istio sidecar overhead on Serverless mode.	Disable Istio sidecar via annotation, or switch to Raw mode.
InferenceService URL returns 404 from outside the cluster	Gateway not exposed externally, or hostname not configured.	Check Gateway / Ingress status; verify DNS points to gateway external IP.
Multi-node predictor (workerSpec) hangs at NCCL init	/dev/shm too small or worker pods not on same NVLink island.	Mount `/dev/shm >= 8Gi`; add pod affinity for same node pool.
ModelMesh runtime returns `MODEL_NOT_LOADED`	LRU evicted the model; load timing too long.	Raise `modelTTL` on ServingRuntime; raise `multiModel.replicas`.
Predictor pod logs say `OPENAI_API_KEY required`	vLLM runtime args missing api-key disable.	Add `--api-key=EMPTY` to runtime args, or set `OPENAI_API_KEY` env.

Where this fits in the Yobitel stack

References

KServe Documentation · KServe Project
kserve on GitHub · GitHub
CNCF KServe Project Page · CNCF
Open Inference Protocol · GitHub
ModelMesh Serving · GitHub
Knative Serving · Knative Project
Gateway API · Kubernetes SIGs

Overview

Quick start

How it works

Reference: InferenceService spec

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte

KServe

Overview

Quick start

How it works

Reference: InferenceService spec

Workload patterns

Sizing and capacity planning

Limits and quotas

Observability

Cost and FinOps

Security and compliance

Migration and alternatives

Troubleshooting

Where this fits in the Yobitel stack

References

Browse all entries

Deploy on Yobibyte