KubeRay

TL;DR

KubeRay is the open-source Kubernetes operator for Ray — the distributed Python framework behind Ray Train, Ray Tune, Ray Data, and Ray Serve.
Provides three CRDs: RayCluster (long-running cluster), RayJob (batch run that creates an ephemeral cluster), RayService (long-running serving cluster with zero-downtime upgrades).
Originated at Anyscale and contributed to the Ray project; widely used by OpenAI, Pinterest, Uber, Shopify, and most LLM-training shops to run multi-node PyTorch jobs on Kubernetes.
Pairs naturally with Volcano (gang scheduling), Kueue (queueing), KServe (serving), and Karpenter (autoscaling) for a complete distributed-training-on-Kubernetes stack.

What Ray Is#

Ray is a Python framework for distributed computing originally from UC Berkeley's RISELab — the same group that produced Spark, Mesos, and Alluxio. Ray's core abstraction is the task-and-actor model: any Python function can be turned into a remote task with `@ray.remote`, and any class into a stateful actor. Underneath, Ray runs a head node (scheduler, GCS, dashboard) and any number of worker nodes that execute tasks.

On top of the core, Ray ships libraries: Ray Train for distributed PyTorch/TF/JAX training, Ray Tune for hyperparameter search, Ray Data for streaming dataset transformations, RLlib for reinforcement learning, and Ray Serve for online serving. KubeRay is how all of this runs on Kubernetes.

CRDs and Lifecycle#

RayCluster — declarative spec of a head pod + worker group templates. Scales by editing replica counts; KubeRay reconciles pod creation, head-worker join, and rolling restarts.
RayJob — submits a Python entrypoint to a cluster (ephemeral by default), waits for completion, captures exit status. The Kubernetes-idiomatic way to run a batch training job.
RayService — long-running cluster behind a service endpoint, with zero-downtime upgrades (KubeRay stands up the new cluster, drains, switches traffic).

RayJob Example#

yaml

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: llama-finetune
spec:
  entrypoint: python train.py --model llama-3-8b
  shutdownAfterJobFinishes: true
  rayClusterSpec:
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.34.0-py311-gpu
              resources:
                limits: { cpu: 8, memory: 32Gi, nvidia.com/gpu: 1 }
    workerGroupSpecs:
      - replicas: 4
        groupName: gpu-workers
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.34.0-py311-gpu
                resources:
                  limits: { cpu: 16, memory: 128Gi, nvidia.com/gpu: 4 }

Distributed Training with Ray Train#

Ray Train wraps PyTorch's DistributedDataParallel and FSDP, TensorFlow MultiWorkerMirroredStrategy, and HuggingFace Accelerate into a single launcher. KubeRay places the worker pods; Ray Train handles rendezvous, NCCL initialisation, and checkpoint coordination. The advantage over running torchrun directly is that Ray Train integrates with Ray Tune (HPO sweeps) and Ray Data (streaming preprocessing), all sharing the same cluster.

For tensor-parallel training on multi-node H100s, Ray Train works but is rarely the fastest path — NeMo, Megatron-LM, or torchtitan launched via MPIJob are usually preferred. Ray Train wins when training is one part of a broader workload that also includes tuning, RL, or streaming data pipelines.

Ray Serve and KServe#

Ray Serve provides Python-native online serving with composition primitives (deployment graphs) that are useful for multi-model pipelines and agentic workloads. RayService runs Ray Serve on Kubernetes via KubeRay. For LLM-only serving with the OpenAI-compatible API, vLLM behind KServe is usually a better fit; for compositional pipelines (RAG with retrieval + rerank + LLM + post-processing in one graph) Ray Serve is the cleaner abstraction.

Integration with the Wider Stack#

Volcano — RayJob can request gang scheduling so all Ray workers start together.
Kueue — RayJob integrates as a Workload, gated by quota and cohort borrowing.
Karpenter — Ray worker pods trigger node provisioning; pair with `do-not-disrupt` annotation for long training runs.
NVIDIA GPU Operator — required prerequisite for GPU pods.
KServe — Ray Serve is one of KServe's supported runtimes for compositional serving.

References

KubeRay Documentation · Ray Project
kuberay on GitHub · GitHub
Ray Documentation · Ray Project

What Ray Is#

CRDs and Lifecycle#

RayCluster — declarative spec of a head pod + worker group templates. Scales by editing replica counts; KubeRay reconciles pod creation, head-worker join, and rolling restarts.

RayJob — submits a Python entrypoint to a cluster (ephemeral by default), waits for completion, captures exit status. The Kubernetes-idiomatic way to run a batch training job.

RayService — long-running cluster behind a service endpoint, with zero-downtime upgrades (KubeRay stands up the new cluster, drains, switches traffic).

RayJob Example#

yaml

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: llama-finetune
spec:
  entrypoint: python train.py --model llama-3-8b
  shutdownAfterJobFinishes: true
  rayClusterSpec:
    headGroupSpec:
      template:
        spec:
          containers:
            - name: ray-head
              image: rayproject/ray:2.34.0-py311-gpu
              resources:
                limits: { cpu: 8, memory: 32Gi, nvidia.com/gpu: 1 }
    workerGroupSpecs:
      - replicas: 4
        groupName: gpu-workers
        template:
          spec:
            containers:
              - name: ray-worker
                image: rayproject/ray:2.34.0-py311-gpu
                resources:
                  limits: { cpu: 16, memory: 128Gi, nvidia.com/gpu: 4 }

Distributed Training with Ray Train#

Ray Serve and KServe#

Integration with the Wider Stack#

Volcano — RayJob can request gang scheduling so all Ray workers start together.

Kueue — RayJob integrates as a Workload, gated by quota and cohort borrowing.

Karpenter — Ray worker pods trigger node provisioning; pair with `do-not-disrupt` annotation for long training runs.

NVIDIA GPU Operator — required prerequisite for GPU pods.

KServe — Ray Serve is one of KServe's supported runtimes for compositional serving.

KubeRay

What Ray Is#

CRDs and Lifecycle#

RayJob Example#

Distributed Training with Ray Train#

Ray Serve and KServe#

Integration with the Wider Stack#

References

Browse all entries

Deploy on Yobitel

KubeRay

What Ray Is#

CRDs and Lifecycle#

RayJob Example#

Distributed Training with Ray Train#

Ray Serve and KServe#

Integration with the Wider Stack#

References

Browse all entries

Deploy on Yobitel