Kubeflow

TL;DR

Kubeflow is a CNCF Incubating umbrella project covering the ML lifecycle on Kubernetes — pipelines, distributed training, hyperparameter tuning, notebooks, and model serving.
Started at Google in 2017 and donated to the CNCF in 2023; now governed by the Kubeflow Steering Committee with maintainers from Google, IBM, Apple, NVIDIA, Red Hat, and Bloomberg.
The sub-projects (Pipelines, Training Operator, Katib, Notebooks, KServe, Spark Operator) are usable independently — most clusters adopt one or two pieces rather than the full distribution.
Training Operator hosts the canonical PyTorchJob, TFJob, MPIJob, XGBoostJob, and JAXJob CRDs — the basis for distributed training on Kubernetes everywhere except Ray-centric stacks.

What Kubeflow Actually Is#

Kubeflow is best understood not as a single product but as a federation of related projects under one governance umbrella. The constituent projects are independently versioned, independently installable, and increasingly used outside the Kubeflow distribution. Knowing which sub-project you care about is more useful than knowing "Kubeflow" as a monolith.

Kubeflow Pipelines — DAG-based ML workflow engine, originally Argo-Workflows-based, now also supports a Tekton backend.
Training Operator — CRDs and reconciliation for distributed PyTorch, TensorFlow, MPI, XGBoost, and JAX jobs.
Katib — hyperparameter tuning and neural architecture search.
Notebooks — multi-user JupyterHub on Kubernetes with per-user GPU quota.
KServe — model serving abstraction (now its own CNCF Incubating project, see separate entry).
Spark Operator — Apache Spark on Kubernetes.
Central Dashboard + Profile Controller — multi-tenant UI and namespace provisioning.

Training Operator#

The Training Operator is the most widely-deployed Kubeflow component. It provides PyTorchJob, TFJob, MPIJob, and friends — Kubernetes CRDs that describe a distributed training run declaratively. The operator reconciles pods, handles rendezvous, surfaces job status, and integrates with Volcano and Kueue for scheduling.

yaml

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: bert-pretrain
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: yobitel/torch-bert:2.4-cu124
              resources: { limits: { nvidia.com/gpu: 8 } }
    Worker:
      replicas: 7
      template:
        spec:
          containers:
            - name: pytorch
              image: yobitel/torch-bert:2.4-cu124
              resources: { limits: { nvidia.com/gpu: 8 } }

Kubeflow Pipelines#

Kubeflow Pipelines (KFP) lets you author ML workflows in Python with `@dsl.component` and `@dsl.pipeline` decorators, compile them to a YAML spec, and run them on a Kubernetes cluster. Each step is a containerised pod with declared inputs and outputs; the engine handles caching, artifact lineage, and parallelism.

KFP v2 (current) is metadata-first — artifacts and lineage are stored in ML Metadata (MLMD) and the SDK is portable across backends (Argo Workflows on Kubeflow, Vertex AI Pipelines on GCP).

Katib#

Katib is the hyperparameter tuning and NAS sub-project. It supports Bayesian optimisation, TPE, random search, grid search, hyperband, and several NAS algorithms (ENAS, DARTS). Trials are launched as Kubernetes Jobs (or PyTorchJobs/TFJobs for distributed trials), and Katib aggregates metrics from each trial's stdout or from a sidecar.

Notebooks and Profiles#

Kubeflow Notebooks provides multi-user JupyterLab, VS Code Server, and RStudio on Kubernetes with per-namespace resource quotas. Profiles are Kubeflow's tenancy abstraction — each user (or team) gets a Profile that maps to a namespace, NetworkPolicies, and resource quotas. Combined with the Central Dashboard, this gives a turnkey research environment for shared GPU clusters.

Adoption Patterns#

Few teams adopt the full Kubeflow distribution. Common subsets in 2026:

Training Operator only — use the CRDs for PyTorch/MPI/TF distributed training, scheduled by Volcano or Kueue.
Pipelines + Training Operator — declarative ML workflows with distributed training steps.
Notebooks + Profiles — multi-tenant research environment, often standalone.
Full distribution — typically only via vendor distributions (Charmed Kubeflow, DKP, Red Hat OpenShift AI).

Don't install "Kubeflow" — install the specific sub-projects you need. Each ships its own Helm chart or kustomize overlay and has lower operational cost than the full distribution.

References

What Kubeflow Actually Is#

Kubeflow Pipelines — DAG-based ML workflow engine, originally Argo-Workflows-based, now also supports a Tekton backend.

Training Operator — CRDs and reconciliation for distributed PyTorch, TensorFlow, MPI, XGBoost, and JAX jobs.

Katib — hyperparameter tuning and neural architecture search.

Notebooks — multi-user JupyterHub on Kubernetes with per-user GPU quota.

KServe — model serving abstraction (now its own CNCF Incubating project, see separate entry).

Spark Operator — Apache Spark on Kubernetes.

Central Dashboard + Profile Controller — multi-tenant UI and namespace provisioning.

Training Operator#

yaml

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: bert-pretrain
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: yobitel/torch-bert:2.4-cu124
              resources: { limits: { nvidia.com/gpu: 8 } }
    Worker:
      replicas: 7
      template:
        spec:
          containers:
            - name: pytorch
              image: yobitel/torch-bert:2.4-cu124
              resources: { limits: { nvidia.com/gpu: 8 } }

Kubeflow Pipelines#

KFP v2 (current) is metadata-first — artifacts and lineage are stored in ML Metadata (MLMD) and the SDK is portable across backends (Argo Workflows on Kubeflow, Vertex AI Pipelines on GCP).

Katib#

Notebooks and Profiles#

Adoption Patterns#

Few teams adopt the full Kubeflow distribution. Common subsets in 2026:

Training Operator only — use the CRDs for PyTorch/MPI/TF distributed training, scheduled by Volcano or Kueue.

Pipelines + Training Operator — declarative ML workflows with distributed training steps.

Notebooks + Profiles — multi-tenant research environment, often standalone.

Full distribution — typically only via vendor distributions (Charmed Kubeflow, DKP, Red Hat OpenShift AI).

Don't install "Kubeflow" — install the specific sub-projects you need. Each ships its own Helm chart or kustomize overlay and has lower operational cost than the full distribution.

Kubeflow

What Kubeflow Actually Is#

Training Operator#

Kubeflow Pipelines#

Katib#

Notebooks and Profiles#

Adoption Patterns#

References

Browse all entries

Deploy on Yobitel

Kubeflow

What Kubeflow Actually Is#

Training Operator#

Kubeflow Pipelines#

Katib#

Notebooks and Profiles#

Adoption Patterns#

References

Browse all entries

Deploy on Yobitel