TL;DR
- Kubeflow is a CNCF Incubating umbrella project covering the ML lifecycle on Kubernetes — pipelines, distributed training, hyperparameter tuning, notebooks, and model serving.
- Started at Google in 2017 and donated to the CNCF in 2023; now governed by the Kubeflow Steering Committee with maintainers from Google, IBM, Apple, NVIDIA, Red Hat, and Bloomberg.
- The sub-projects (Pipelines, Training Operator, Katib, Notebooks, KServe, Spark Operator) are usable independently — most clusters adopt one or two pieces rather than the full distribution.
- Training Operator hosts the canonical PyTorchJob, TFJob, MPIJob, XGBoostJob, and JAXJob CRDs — the basis for distributed training on Kubernetes everywhere except Ray-centric stacks.
What Kubeflow Actually Is#
Kubeflow is best understood not as a single product but as a federation of related projects under one governance umbrella. The constituent projects are independently versioned, independently installable, and increasingly used outside the Kubeflow distribution. Knowing which sub-project you care about is more useful than knowing "Kubeflow" as a monolith.
- Kubeflow Pipelines — DAG-based ML workflow engine, originally Argo-Workflows-based, now also supports a Tekton backend.
- Training Operator — CRDs and reconciliation for distributed PyTorch, TensorFlow, MPI, XGBoost, and JAX jobs.
- Katib — hyperparameter tuning and neural architecture search.
- Notebooks — multi-user JupyterHub on Kubernetes with per-user GPU quota.
- KServe — model serving abstraction (now its own CNCF Incubating project, see separate entry).
- Spark Operator — Apache Spark on Kubernetes.
- Central Dashboard + Profile Controller — multi-tenant UI and namespace provisioning.
Training Operator#
The Training Operator is the most widely-deployed Kubeflow component. It provides PyTorchJob, TFJob, MPIJob, and friends — Kubernetes CRDs that describe a distributed training run declaratively. The operator reconciles pods, handles rendezvous, surfaces job status, and integrates with Volcano and Kueue for scheduling.
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: bert-pretrain
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
template:
spec:
containers:
- name: pytorch
image: yobitel/torch-bert:2.4-cu124
resources: { limits: { nvidia.com/gpu: 8 } }
Worker:
replicas: 7
template:
spec:
containers:
- name: pytorch
image: yobitel/torch-bert:2.4-cu124
resources: { limits: { nvidia.com/gpu: 8 } }Kubeflow Pipelines#
Kubeflow Pipelines (KFP) lets you author ML workflows in Python with `@dsl.component` and `@dsl.pipeline` decorators, compile them to a YAML spec, and run them on a Kubernetes cluster. Each step is a containerised pod with declared inputs and outputs; the engine handles caching, artifact lineage, and parallelism.
KFP v2 (current) is metadata-first — artifacts and lineage are stored in ML Metadata (MLMD) and the SDK is portable across backends (Argo Workflows on Kubeflow, Vertex AI Pipelines on GCP).
Katib#
Katib is the hyperparameter tuning and NAS sub-project. It supports Bayesian optimisation, TPE, random search, grid search, hyperband, and several NAS algorithms (ENAS, DARTS). Trials are launched as Kubernetes Jobs (or PyTorchJobs/TFJobs for distributed trials), and Katib aggregates metrics from each trial's stdout or from a sidecar.
Notebooks and Profiles#
Kubeflow Notebooks provides multi-user JupyterLab, VS Code Server, and RStudio on Kubernetes with per-namespace resource quotas. Profiles are Kubeflow's tenancy abstraction — each user (or team) gets a Profile that maps to a namespace, NetworkPolicies, and resource quotas. Combined with the Central Dashboard, this gives a turnkey research environment for shared GPU clusters.
Adoption Patterns#
Few teams adopt the full Kubeflow distribution. Common subsets in 2026:
- Training Operator only — use the CRDs for PyTorch/MPI/TF distributed training, scheduled by Volcano or Kueue.
- Pipelines + Training Operator — declarative ML workflows with distributed training steps.
- Notebooks + Profiles — multi-tenant research environment, often standalone.
- Full distribution — typically only via vendor distributions (Charmed Kubeflow, DKP, Red Hat OpenShift AI).
Don't install "Kubeflow" — install the specific sub-projects you need. Each ships its own Helm chart or kustomize overlay and has lower operational cost than the full distribution.
References
- Kubeflow Documentation · Kubeflow
- kubeflow on GitHub · GitHub
- CNCF Kubeflow Project Page · CNCF