Kubeflow

Kubeflow AI Reference Platform 26.03 Release Announcement

2026-04-11T00:00:00+00:00

Kubeflow AI Reference Platform 26.03 delivers key improvements in scalability, security, and operational efficiency. It reduces per-namespace overhead, enhances multi-tenant configurations, and increases reliability for large-scale Kubernetes deployments.

This release adopts a calendar-based versioning model (Year.Month.Patch), with two primary releases annually and optional patches. Community support is best-effort for approximately six months, with additional commercial support options available. Regular upgrades are recommended to take advantage of continuous security and performance enhancements.

Highlight features

Kubernetes 1.34+
Kubeflow Pipelines 2.16.0, Spark operator 2.5.0 Model registry v0.3.5, Kserve Web Application v0.16.1
Compatibility of Kubeflow Pipelines v1 and v2 with PSS restricted
Extended Kserve test with authentication and authorization from inside and outside the cluster as well as non-knative / raw deployments
Simplified installation and automatic installation of the right Kustomize and Kubectl versions
Installation steps tested and based on our CI, easier in-place updates (optimized PDBs)
Cleanup of all synchronization steps for faster releases / updates of dependencies
Knative 1.20, cert-manager 1.19.4, Oauth2-proxy v7.14.3, Dex 2.45.0
Fix networkpolicies for cert-manager, knative-serving, istio-system, dex, oauth2-proxy

Kubeflow Platform (Manifests & Security)

The Kubeflow Platform Working Group focuses on simplifying Kubeflow installation, operations, and security. See details below.

Manifests:

Documentation updates that make it easier to install, extend and upgrade Kubeflow
For more details and future plans please check 26.06 roadmap.

Notebooks	Dashboard	Pipelines	Katib	Trainer	KServe	Model Registry	Spark	SDK
1.10.0	1.10.0	2.16.0	0.19	2.1.0	0.17.0	0.3.7	2.5.0	0.4.0

Kubernetes	Kind	Kustomize	Cert Manager	Knative	Istio	Dex	OAuth2-proxy
[1.35+] (https://github.com/kubernetes/kubernetes/releases/tag/v1.35.3)	0.30.1	5.8.1	1.20.2	1.21.0	1.29.1	2.45.1	7.15.1

Security:

Pipelines

Model Registry

Training Operator (Trainer) & Katib

Spark Operator

KServe

Kubeflow SDK

Dashboard and Notebooks

How to get started with 26.03

Visit the Kubeflow AI Reference Platform 26.03 release page or head over to the Getting Started and Support pages.

Join the Community

We would like to thank everyone who contributed to Kubeflow 26.03, and especially Tarek Abouzeid for his work as the v26.03 Release Manager. We also extend our thanks to the entire release team and the working group leads, who continuously and generously dedicate their time and expertise to Kubeflow.

Release team members : Tarek Abouzeid, Anya Kramar, Andy Stoneberg, Humair Khan, Matteo Mortari, Adysen Rothman, Jon Burdo, Milos Grubjesic, Vraj Bhatt, Dhanisha Phadate, Alok Dangre

Working Group leads : Andrey Velichkevich, Julius von Kohout, Mathew Wicks, Matteo Mortari

Kubeflow Steering Committee : Andrey Velichkevich, Julius von Kohout, Yuan Tang, Johnu George, Francisco Javier Araceo

You can find more details about Kubeflow distributions here.

Want to help?

The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!

Visit our Kubeflow website or Kubeflow GitHub Page.
Join the Kubeflow Slack channel.
Join the kubeflow-discuss mailing list.
Attend our weekly community meeting.

Modernizing Kubeflow Pipelines UI

2026-03-31T00:00:00+00:00

The Kubeflow Pipelines web interface has been upgraded from React 16 to React 19 — a modernization effort that touches every layer of the frontend stack. Whether you use the UI to manage pipelines day-to-day or contribute to the codebase, here is what this means for you.

What’s changing for users

You do not need to do anything differently. Your bookmarks, workflows, and browser all work exactly as before. But under the hood, the UI is now built on a modern foundation that delivers tangible improvements:

A faster, more responsive interface

React 18 introduced automatic batching, which reduces unnecessary re-renders across the UI. In practice, this means pages like Run Details, Experiment Details, and the pipeline creation flow respond faster to your interactions. Forms validate without flicker, and multi-step workflows feel snappier. The production bundle size stayed exactly the same — 0% increase — so page load times are unchanged.

The pipeline DAG visualization (the graph you see when inspecting a pipeline’s structure) has been migrated from the deprecated react-flow-renderer to @xyflow/react. This brings improved pan, zoom, and drag performance, especially on larger or more complex pipeline graphs. If you’ve ever experienced sluggishness when navigating a deeply nested pipeline, this upgrade directly addresses that.

Improved charts and metrics display

Run metrics and comparison charts now use Recharts instead of the deprecated react-vis library. The new charting library renders more efficiently, handles edge cases better, and provides cleaner visual output when comparing run results side by side.

Better accessibility

The component library migration from Material-UI v3 to MUI v5 brings improved keyboard navigation, better ARIA attribute coverage, and more consistent focus management across dialogs, tables, and form elements. These improvements make the UI more usable with screen readers and keyboard-only workflows.

No breaking changes

Every user-facing feature works the same way it did before. The API contracts are unchanged. If you use the KFP Python SDK or REST API to interact with the platform, nothing changes on your end. This upgrade was purely a frontend modernization — zero impact on backend behavior, pipeline execution, or artifact storage.

Why we made this change

The KFP frontend had been running on React 16 (released in 2017) with Material-UI v3, create-react-app, and Jest/Enzyme for testing. This created compounding issues:

Security exposure. React 16 and 17 no longer receive security patches, and dozens of transitive dependencies were locked to outdated versions because of React peer constraints.
Stalled ecosystem. Modern libraries — including improved data-fetching, visualization, and accessibility tools — dropped support for React 16/17. Staying behind meant the UI could not benefit from upstream improvements.
Contributor friction. The legacy CRA + Jest + Enzyme toolchain was slow to build, brittle to test, and increasingly difficult for new contributors to set up. Modernizing the stack lowers the barrier to contribution.

How we got here

Rather than attempting a single risky version jump, we followed a deps-first, bump-last strategy: upgrade every dependency to be forward-compatible before touching React itself. A custom React peer compatibility gate in CI prevented regressions at every step. The work was executed across 20+ pull requests in strict dependency order.

React 16 → 17: Rebuilding the foundation

Before React could move forward, the entire build and test toolchain had to be replaced. create-react-app was swapped for Vite, Jest + Enzyme gave way to Vitest + Testing Library, and Material-UI was upgraded from v3 to v4 to unblock the React 17 peer range. The deprecated react-vis charting library was replaced with Recharts. With those blockers cleared, the React 17 bump itself was a small, low-risk change.

React 17 → 18: The biggest leap

This phase required the most dependency work. Storybook jumped from v6 straight to v10 on the Vite builder. Material-UI v4 was migrated to MUI v5 with Emotion. react-query moved to @tanstack/react-query v4. react-flow-renderer was replaced with @xyflow/react. After all ecosystem deps cleared the peer gate, the React 18 core bump landed — followed by careful stabilization of automatic batching behavior in class components that were reading stale state.

React 18 → 19: The final stretch

A deprecation audit at React 18.3 found zero React-specific warnings. A final dependency sweep cleared the last peer blockers (react-ace, transitive react-redux). The React 19 bump resolved the final allowlist entry and handled a small set of API changes like the removal of forwardRef in test mocks.

The full stack transformation

Over the course of this effort, virtually every layer of the frontend stack was modernized:

Layer	Before	After
React	16	19
Build system	Create React App + Craco	Vite
Test framework	Jest + Enzyme	Vitest + Testing Library
UI component library	Material-UI v3	MUI v5 + Emotion
Data fetching	react-query v3	@tanstack/react-query v4
Pipeline graph	react-flow-renderer v9	@xyflow/react
Charts	react-vis	Recharts
Storybook	6 (Webpack)	10 (Vite)

By the numbers

20+ PRs merged across the entire React 16-to-19 effort
15 tracked milestones executed in strict dependency order
0% bundle size increase — page load times unchanged
0 React deprecation warnings at the 18.3 checkpoint audit
0 breaking changes to user-facing features or APIs

Want to contribute?

The full execution plan with every PR, issue, and dependency graph is tracked in the react-18-19-upgrade-checklist.md. Look for miscellaneous bugs, report bugs, help with reviews and help improve our documentation.

Huge thanks to @jeffspahr, @kanishka-commits, @PR3MM, @jsonmp-k8, @dpanshug, and @rishi-jat for contributing to this effort and reviewing all the contributions leading up to this milestone!

Kubeflow Trainer v2.2: JAX & XGBoost Runtimes, Flux for HPC Support, and TrainJob progress and metrics observability

2026-03-20T00:00:00+00:00

Just a little over one week ahead of KubeCon + CloudNativeCon EU 2026, the Kubeflow team is excited to ship Trainer v2.2. The v2.2 release reinforces our commitment to expanding the Kubeflow Trainer ecosystem – meeting developers where they are by adding native support for JAX, XGBoost, and Flux, while also delivering deeper observability into training jobs.

Key highlights of the v2.2 release include:

First-class support for Training Runtimes for JAX and XGBoost, enabling native distributed training on Kubernetes. This marks a major milestone for the Trainer project, achieving full compatibility with Training Operator v1 CRDs: PyTorchJob, MPIJob, JAXJob, and XGBoostJob – now unified under a single TrainJob abstraction.
Enhanced training observability, allowing progress and metrics to be propagated directly from training scripts to the TrainJob status. Hugging Face Transformers already integrate with the KubeflowTrainerCallback to automate this capability.
Flux runtime support, bringing HPC workloads to Kubernetes and improving MPI bootstrapping within TrainJob.
TrainJob activeDeadlineSeconds API, enabling explicit timeout policies for training jobs.
RuntimePatches API, introducing a more flexible and scalable way to customize runtime configurations from the TrainJobs.

You can now install the Kubeflow Trainer control plane and its training runtimes with a single command:

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
    --namespace kubeflow-system \
    --create-namespace \
    --version 2.2.0 \
    --set runtimes.defaultEnabled=true

Bringing JAX to Kubernetes with Trainer

Kubeflow Trainer supports running JAX workloads on Kubernetes through the jax-distributed runtime. It is designed for distributed and parallel JAX computation using jax.distributed and SPMD primitives like pmap, pjit, and shard_map. The runtime maps one Kubernetes Pod to one JAX process and injects the required distributed environment variables so training or fine-tuning can run consistently across multiple nodes and devices.

Multi-process CPU training
Multi-GPU training using CUDA enabled JAX
Data-parallel and model-parallel JAX workloads
Massive scale TPU distributed training with ComputeClases

Start by following the Getting Started guide for Kubeflow Trainer basics and making sure you have Kubeflow SDK installed on your machine:

pip install kubeflow

Use the jax-distributed runtime and initialize JAX distributed explicitly in your training script before any JAX computation:

from kubeflow.trainer import TrainerClient, CustomTrainer

def get_jax_dist():
    import os
    import jax
    import jax.distributed as dist

    dist.initialize(
        coordinator_address=os.environ["JAX_COORDINATOR_ADDRESS"],
        num_processes=int(os.environ["JAX_NUM_PROCESSES"]),
        process_id=int(os.environ["JAX_PROCESS_ID"]),
    )

    print("JAX Distributed Environment")
    print(f"Local devices: {jax.local_devices()}")
    print(f"Global device count: {jax.device_count()}")

    import jax.numpy as jnp
    x = jnp.ones((4,))
    y = jax.pmap(lambda v: v * jax.process_index())(x)
    print("PMAP result:", y)

client = TrainerClient()
job_id = client.train(
    runtime="jax-distributed",
    trainer=CustomTrainer(func=get_jax_dist),
)
client.wait_for_job_status(job_id)
print("\n".join(client.get_job_logs(name=job_id)))

The jax-distributed runtime injects JAX_NUM_PROCESSES, JAX_PROCESS_ID, and JAX_COORDINATOR_ADDRESS into the environment, and all processes must call jax.distributed.initialize() exactly once before any JAX computation.

For more details, refer to the Kubeflow Trainer JAX guide for jax.distributed and SPMD primitives.

Bringing XGBoost to Kubernetes with Trainer

Running distributed XGBoost workloads on Kubernetes has traditionally required manual setup of communication layers, environment variables, and cluster coordination. With this release, Kubeflow Trainer introduces built-in support for XGBoost, enabling seamless distributed training with minimal configuration.

The new xgboost-distributed runtime abstracts away the complexity of setting up XGBoost’s collective communication (Rabit). Trainer automatically provisions worker pods using JobSet and injects the required DMLC environment variables, allowing workers to coordinate and synchronize during training. The rank 0 pod is automatically configured to act as the tracker, simplifying cluster setup even further.

This integration supports both CPU and GPU workloads out of the box. For CPU training, each node runs a single worker leveraging OpenMP for intra-node parallelism. For GPU workloads, each GPU is mapped to an individual worker, enabling efficient scaling across nodes.

For more information, please see this Notebook example and documentation guide.

Track TrainJob Progress and Expose Metrics

In this release, Kubeflow Trainer introduces a powerful new capability to automatically update TrainJob status with real-time training progress and metrics generated directly from your ML code. This enables key insights: such as percentage completion, estimated time remaining (ETA), and training metrics–to be surfaced through the TrainJob API, eliminating the need to manually inspect training logs.

How it works

When this feature is enabled (feature flag TrainJobStatus is required), Kubeflow Trainer starts an HTTP server that exposes endpoints for reporting training progress and metrics. Client applications can send updates to these endpoints, and the TrainJob controller will automatically reflect this information in the job status. Users can then easily access these insights through the Kubeflow SDK without needing to inspect logs.

To simplify adoption, we are collaborating with popular ML frameworks to integrate Kubeflow Trainer callbacks that automate this process. With these integrations, users don’t need to change anything to make it work!

For example, this functionality is already available in Hugging Face Transformers, where metrics are automatically reported when using the Trainer:

from transformers import Trainer, TrainingArguments

trainer = Trainer(model=model, args=TrainingArguments(...), train_dataset=ds)
trainer.train()  # Progress automatically reported when running in Kubeflow

Future Plans

We have an exciting roadmap for this feature, including support for periodic, transparent checkpointing based on ETA, as well as integration with OptimizationJob for hyperparameter tuning jobs.

To learn more about this feature please see this proposal.

Bringing Flux Framework for HPC and MPI Bootstrapping

Setting up distributed ML training jobs using MPI can be very time consuming: from stitching together launcher-worker topologies to configuring SSH-based bootstrapping, there’s a lot of moving parts that require code on top of your training code. In v2.2, Kubeflow Trainer brings the Flux Framework – a workload manager that combines hierarchical job management with graph-based scheduling – to handle your HPC-style scheduling needs without the overhead that typically comes with it.

Flux uses ZeroMQ to bootstrap MPI, an improvement over traditional SSH, and also brings PMIx and support for more MPI variants. When a training job is submitted, an init container automatically handles Flux’s installation, meaning that you do not need to install Flux to your application container. The plugin also handles cluster discovery, broker configuration, and CURVE certificate generation to provide cryptographic security for the overlay network.

For teams whose workloads sit at the intersection of ML and HPC, Flux serves as a portability layer that enables running simulation alongside AI/ML workloads. Scheduling to Flux bypasses any potential etcd bottlenecks, and the limitations of the Kubernetes scheduler that require tricks to batch schedule to an underlying single-pod queue. Flux enables fine-grained control over where pods land, and is ideal when you are running simulation pipelines that feed into model Training. This integration also enables the use of Process Management Interface Exascale (PMIx) to manage and coordinate large-scale MPI workloads on Kubernetes using TrainJobs, something that was previously not possible.

Apply the Flux runtime and a TrainJob manifest. For example:

kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/flux-runtime.yaml
kubectl apply -f https://raw.githubusercontent.com/kubeflow/trainer/refs/heads/master/examples/flux/lammps-train-job.yaml

After that, monitor the pods with kubectl get pods --watch, and inspect the lead broker logs with kubectl logs -c node -f . This also shows how to run the Flux cluster in interactive mode with flux-interactive.yaml, and then use kubectl exec and flux proxy to connect to the lead broker Flux instance and manually run LAMMPS inside the cluster.

The Flux runtime depends on the mlPolicy: flux trigger in flux-runtime.yaml, and you can customize the setup through environment variables such as FLUX_VIEW_IMAGE and FLUX_NETWORK_DEVICE. Binaries are installed under /mnt/flux, software is copied to /opt/software, and configurations are stored in /etc/flux-config. Related documentation includes the Kubeflow Trainer Getting Started guide, the Flux example manifests, and the Flux Framework HPSF project resources. A simple implementation has been done for this first go, and users are encouraged to submit feedback to request exposure of additional features. A demo video will be showcased at the KubeCon + CloudNativeCon 2026 EU booth for those that can attend.

You can learn more about this in our Flux Guide.

Resource Timeout for TrainJobs

Previously, TrainJob resources persisted in the cluster indefinitely after completion unless manually removed, which led to Etcd bloat, resource contention and no automatic garbage collection. A job could also get stuck or run indefinitely, wasting CPU/GPU capacity and reducing cluster efficiency. In v2.2, Kubeflow Trainer adds support for ActiveDeadlineSeconds API in TrainJob. This field lets users set a hard timeout (in seconds) for a TrainJob’s active execution timeline. When the deadline is exceeded, Trainer marks the TrainJob as Failed (reason: DeadlineExceeded), terminates the running workload, and deletes the underlying JobSet.

There’s a couple ways to specify the timeout limit of a job, the first one is by modifying the TrainJob manifest directly:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: quick-experiment
spec:
  activeDeadlineSeconds: 28800 #Max runtime 8 hours
runtimeRef:
  name: torch-distributed-gpu
trainer:
  image: my-training:latest
  numNodes: 2

More information about how to configure lifecycle policies for TrainJobs can be found in our TrainJob Lifecycle Guide

RuntimePatches API to override TrainJob defaults

In many distributed learning environments, multiple controllers can interact with the same TrainJob manifest, making ownership boundaries really important to preserve. The new RuntimePatches API replaces PodTemplateOverrides with a manager-keyed structure that makes it explicit on who applied what and when.

Each patch is scoped to a named manager and can target specific jobs or pods within the runtime, with both job-level and pod-level overrides supported. This means Kueue can inject node selectors and tolerations into the trainer pod without conflicting with another controller managing job-level metadata, and the full history of what was applied is preserved directly in the spec.

In the new TrainJob manifest, every manager owns its own entry, pod and job overrides are separate fields under that manager. Note that your manager field will be immutable after creation:

apiVersion: trainer.kubeflow.org/v2alpha1
kind: TrainJob
metadata:
  name: pytorch-distributed
spec:
  runtimeRef:
    name: pytorch-distributed-gpu
  trainer:
    image: docker.io/custom-training
  runtimePatches:
    - manager: trainer.kubeflow.org/kubeflow-sdk # who owns this entry (immutable)
      trainingRuntimeSpec:
        template:
          spec:
            replicatedJobs:
              - name: node
                template:
                  spec:
                    template:
                      spec:
                        nodeSelector:
                          accelerator: nvidia-tesla-v100

Note that the RuntimePatches API cannot be used to set environment variables for the node, dataset-initializer, or model-initializer containers, nor to override command, args, image, or resources on the trainer container.

For a complete description of the API’s structure, restrictions and use cases, check out the RuntimePatches Operator Guide.

⚠️ This API introduces Breaking Changes!!

PodTemplateOverrides has been removed in v2.2. If you’re currently using it in your TrainJob manifests, you’ll need to migrate to the RuntimePatches API.

Breaking Changes

This release introduces a set of architectural improvements and breaking changes that lay the foundations for a more scalable and modularized Trainer. Please review the following when upgrading to Trainer v2.2:

Replace PodTemplateOverrides with RuntimePatches API

As mentioned above, PodTemplateOverrides has been replaced with RuntimePatches API to support manager-scoped customization and prevent conflicts when multiple controllers are patching the same TrainJob.

If you are using PodTemplateOverrides in your TrainJob manifests or SDK code, you will need to migrate to the manager-keyed RuntimePatches structure. See the RuntimePatches Operator Guide, and Options Reference for more information.

Remove numProcPerNode from the Torch MLPolicy API

The numProcPerNode field has been removed from the Torch MLPolicy. Process-per-node configuration is now handled directly through the container resources, so any TrainJob manifests or SDK calls that set numProcPerNode explicitly will need to be updated before upgrading to v2.2.

Remove ElasticPolicy API

The ElasticPolicy API has been removed from MLPolicy in Trainer v2.2. Elastic training is not yet available in this release, we are actively working on a redesigned implementation for future release. If your TrainJobs rely on elastic training configuration, please hold off on upgrading until that work lands.

Some TrainJob API fields are now immutable

Several TrainJob spec fields are now properly enforced as immutable after job creation. This rejects modifications to fields such as .spec.trainer.image on a running TrainJob upfront instead of having it silently fail at the JobSet controller level. If your workflows rely on updating these fields on a running TrainJob, those updates will now be rejected by the admission webhood. Please review your TrainJob update logic to ensure compatibility with our immutability policies in v2.2.

Release Notes

For the complete list of all pull requests, visit the GitHub release page: https://github.com/kubeflow/trainer/releases/tag/v2.2.0

Roadmap Moving Forward

We are excited to continue pushing Kubeflow as a state of the art platform for distributed ML training by making TrainJob manifests more observable and more performant across a wide range of hardware.

One area we’re particularly excited about is bringing Multi-Node NVLink (MNNVL) support for TrainJobs, enabling them to treat GPUs across multiple machines as a single unified memory domain. For large-scale training, this means significantly faster node-to-node communication compared to standard network-based primitives and brings forth a new era of configurations that simply weren’t practical before on Kubernetes. We are working closely with Kubernetes community to introduce first class support for Dynamic Resource Allocation (DRA) in TrainJobs.

We look forward to introducing Automatic configuration of GPU requests for TrainJobs that will take the guesswork out of choosing the right resources. With intelligent methods guiding the process, Trainer will choose appropriate resources automatically based on the TrainJob configuration. This gives teams the power to plan experiments with confidence and trust that jobs use just the right amount of compute.

Workload-Aware Scheduling (WAS) is also actively being integrated with the native Kubernetes Workload API for TrainJob to bring robust gang-scheduling support for distributed training without third party plugins. The integration will be available after Kubernetes v1.36, and we plan to extend it further to support Topology-Aware Scheduling and Dynamic Resource Allocation (DRA) as those APIs mature.

A full list of our 2026 roadmap can be found here.

Join the Community

The Kubeflow Trainer is built by and for the community. We welcome contributions, feedback, and participation from everyone! We want to thank the community for their contributions to this release. We invite you to:

Headed to KubeCon + CloudNativeCon 2026 EU? Stop by the Kubeflow booth to see these features in action 😸🧊!!

Kubeflow SDK v0.4.0: Model Registry, SparkConnect, and Enhanced Developer Experience

2026-03-19T00:00:00+00:00

Explore the full documentation at sdk.kubeflow.org

With KubeCon just around the corner, we are pleased to announce the release of Kubeflow SDK v0.4.0. This release continues the work toward providing a unified, Pythonic interface for all AI workloads on Kubernetes.

The v0.4.0 release focuses on bridging the gap between data engineering, model management, and production-ready ML pipelines. The Kubeflow SDK now covers most of the MLOps lifecycle – from data processing and hyperparameter optimization to model training and registration:

Highlights in Kubeflow SDK v0.4.0 include:

Model Registry Client for managing model artifacts, versions, and metadata directly from the SDK.
SparkClient API with SparkConnect support for interactive data processing
Namespaced TrainingRuntimes for improved isolation and multi-tenant platform management
Dataset and Model Initializers enabling better parity between local and Kubernetes execution
A new Kubeflow SDK documentation website with examples, and API reference
Minimum Python version updated to Python 3.10 for improved security, typing, and runtime performance

Unified Model Management: The Model Registry Client

Managing model artifacts, versions, and metadata across experiments has historically required stitching together multiple tools outside of your training code. In v0.4.0, the SDK introduces ModelRegistryClient – a Pythonic interface to the Kubeflow Model Registry, available under the new kubeflow.hub submodule.

The client exposes a minimal, curated API: register models, retrieve them by name and version, update their metadata, and iterate over what’s in your registry – all without leaving the SDK. It integrates directly with the Model Registry server and supports token auth and custom CA configuration for production clusters. To install the Model Registry server, see the installation guide.

Install the hub extra to get started:

pip install 'kubeflow[hub]'

Usage Example

from kubeflow.hub import ModelRegistryClient

client = ModelRegistryClient(
    "https://model-registry.kubeflow.svc.cluster.local",
    author="Your Name",
)

# Register a model
model = client.register_model(
    name="my-model",
    uri="s3://bucket/path/to/model",
    version="1.0.0",
    model_format_name="pytorch",
)

# List all models
for model in client.list_models():
    print(f"Model: {model.name}")

# Get a specific version and artifact
version = client.get_model_version("my-model", "1.0.0")
artifact = client.get_model_artifact("my-model", "1.0.0")
print(f"Model URI: {artifact.uri}")

Note: list_models() and list_model_versions() return lazy iterators backed by pagination, so only the data you consume results in API calls – making it efficient to work with large registries.

Distributed AI Data at Scale: SparkClient & SparkConnect

Data is a fundamental piece to every AI workload, and Apache Spark has become a cornerstone technology for large-scale data processing. However, deploying and managing Spark workloads on Kubernetes has traditionally required users to work directly with Kubernetes manifests and YAML configurations – a process that can be operationally complex. In v0.4.0, the SDK introduces SparkClient – a high-level, Pythonic API that eliminates this complexity, allowing data engineers and ML practitioners to manage interactive and batch Spark workloads on Kubernetes without writing a single line of YAML. Backed by the Kubeflow Spark Operator (KEP-107), the initial version of SparkClient introduces support for interactive sessions through the SparkConnect custom resource. In future releases of the Kubeflow SDK, we will expand this support to include batch workloads as well.

SparkClient supports two operational modes. In create mode, the SDK provisions a new SparkConnect interactive session on Kubernetes for you – handling CRD creation, pod scheduling, networking, and cleanup automatically. In connect mode, you point it at an existing Spark Connect server, useful for shared clusters or cross-namespace access. Either way, you get back a standard SparkSession and can write the same PySpark code you already know.

Install Kubeflow Spark support:

pip install 'kubeflow[spark]'

To install the Spark Operator, see the installation guide.

Usage Example

from kubeflow.spark import SparkClient, Name
from kubeflow.common.types import KubernetesBackendConfig

client = SparkClient(
    backend_config=KubernetesBackendConfig(namespace="spark-test")
)

# Level 1: Minimal - use all defaults
spark = client.connect(options=[Name("my-session")])
df = spark.range(5)
df.show()
client.delete_session("my-session")

# Level 2: Simple -- configure executors and resources
spark = client.connect(
    num_executors=5,
    resources_per_executor={"cpu": "5", "memory": "1Gi"},
    spark_conf={"spark.sql.adaptive.enabled": "true"},
    options=[Name("my-session-2")],
)
df = spark.range(5)
df.show()
client.delete_session("my-session-2")

# Connect mode -- attach to an existing Spark Connect server
spark = client.connect(base_url="sc://spark-server:15002")
df = spark.sql("SELECT * FROM my_table")
df.show()

Default specifications: Spark 4.0.1, 1 executor, 512Mi memory and 1 CPU per pod, 300 second session timeout.

Note: v0.4.0 focuses on SparkConnect session management. Batch job support via SparkApplication CR (submit_job, get_job, list_jobs) is planned for a future release.

A New Home for Documentation

To support the Kubeflow SDK users and contributors, we’ve introduced a dedicated Kubeflow SDK Website. This site includes:

Quickstart: Train your first model with Kubeflow SDK
API Reference: Automatically updated documentation for all SDK modules.
Examples: Step-by-step guides from local prototyping to remote training.

Infrastructure & Breaking Changes

This release includes several architectural updates to ensure the SDK remains secure, scalable, and easy to use. Please note the following requirements when upgrading to v0.4.0.

Better Isolation with Namespaced TrainingRuntimes

Security and multi-tenancy are core to Kubeflow. In v0.4.0, we’ve introduced support for Namespaced TrainingRuntimes. This allows platform teams to provide curated training environments at the namespace level, ensuring that one team’s custom training configuration doesn’t interfere with another’s.

Upgrade Note: The SDK now prioritizes namespaced runtimes over cluster-wide ones. If you have runtimes with duplicate names in different scopes, verify your TrainerClient calls are targeting the intended resources.

Furthering Parity Between Local and Remote Execution

One of the biggest hurdles in MLOps is the “it worked on my machine” syndrome. With the addition of Dataset and Model Initializers for the ContainerBackend, the SDK now emulates how Kubernetes handles data dependencies.

Whether you are running locally on Docker or at scale on a cluster, the SDK now automatically manages the “plumbing” of mounting and initializing your data. This ensures your local development environment mirrors the data-loading behavior of your production training jobs.

Required: Upgrading to Python 3.10+

To maintain a secure and performant codebase, Kubeflow SDK v0.4.0 is officially moving its minimum requirement to Python 3.10.

This change ensures that all SDK users benefit from better security patches, improved type-hinting, and more efficient asynchronous networking for our API clients.

To Upgrade: Ensure your local environment, Notebook images, and CI/CD pipelines are running Python 3.10 or higher before running pip install --upgrade kubeflow

What’s Next for Kubeflow SDK

Looking ahead, the Kubeflow SDK 2026 Roadmap outlines several exciting initiatives:

Kubeflow MCP Server to enable AI-assisted interactions with Kubeflow resources
OpenTelemetry integration for improved observability across SDK operations
MLflow support for experiment tracking and metrics
First class support for Kubeflow Pipelines to bring KFP into the unified SDK
TrainJob checkpointing and dynamic LLM Trainers for more flexible and resilient training workflows
End-to-end AI pipelines orchestrating data processing, training, and optimization using SparkClient, TrainerClient, and OptimizerClient
Multi-cluster job submission leveraging Kueue and Multi-Kueue capabilities for Spark and training workloads
Batch Spark job support via SparkApplication CR for submit, get, and list operations

We encourage the community to review and contribute to the roadmap.

Get Involved!

The Kubeflow SDK is built by and for the community. We welcome contributions, feedback, and participation from everyone! We want to thank the community for their contributions to this release. We invite you to:

Try it out: pip install kubeflow==0.4.0
Contribute:
- Read the Contributing Guide.
- Browse the good first issues
- Explore the GitHub Repository

Connect with the Community:

Join #kubeflow-sdk on CNCF Slack
Attend the Kubeflow SDK and ML Experience WG meetings

Learn More

Visit the Kubeflow SDK Website
View the full Changelog.

Headed to KubeCon + CloudNativeCon 2026 EU? Stop by the Kubeflow booth to see these features in action!

Introducing the Metaflow-Kubeflow Integration

2026-02-04T00:00:00+00:00

A tale of two flows: Metaflow and Kubeflow

Metaflow is a Python framework for building and operating ML and AI projects, originally developed and open-sourced by Netflix in 2019. In many ways, Kubeflow and Metaflow are cousins: closely related in spirit, but designed with distinct goals and priorities.

Metaflow emerged from Netflix’s need to empower data scientists and ML/AI developers with developer-friendly, Python-native tooling, so that they could easily iterate quickly on ideas, compare modeling approaches, and ship the best solutions to production without heavy engineering or DevOps involvement. On the infrastructure side, Metaflow started with AWS-native services like AWS Batch and Step Functions, later expanding to provide first-class support for the Kubernetes ecosystem and other hyperscaler clouds.

In contrast, Kubeflow began as a set of Kubernetes operators for distributed TensorFlow and Jupyter Notebook management. Over time, it has evolved into a comprehensive Cloud Native AI ecosystem, offering a broad set of tools out of the box. These include Trainer, Katib, Spark Operator for orchestrating distributed AI workloads, Workspaces for interactive development environments, Hub for AI catalog and artifacts management, KServe for model serving, and Pipelines to deploy end-to-end ML workflows and stitching Kubeflow components together.

Over the years, Metaflow has delighted end users with its intuitive APIs, while Kubeflow has delivered tons of value to infrastructure teams through its robust platform components. This complementary nature of the tools motivated us to build a bridge between the two: you can now author projects in Metaflow and deploy them as Kubeflow Pipelines, side by side with your existing Kubeflow workloads.

Why Metaflow → Kubeflow

In the most recent CNCF Technology Radar survey from October 2025, Metaflow got the highest positive scores in the “likelihood to recommend” and “usefulness” categories, reflecting its success in providing a set of stable, productivity-boosting APIs for ML/AI developers.

Metaflow spans the entire development lifecycle—from early experimentation to production deployment and ongoing operations. To give you an idea, the core features below illustrate the breadth of its API surface, grouped by project stage:

Development

Straightforward APIs for creating and composing workflows.
Automated state transfer and management through artifacts, allowing you to build flows incrementally and resume them freely (see a recent article by Netflix about the topic)
Interactive, real-time visual outputs from tasks through cards - a perfect substrate for custom observability solutions, created quickly with AI copilots.
Choose the right balance between code and configuration through built-in configuration management.
Create domain-specific abstractions and project-level policies through custom decorators.

Scaling

Scale flows horizontally and vertically: Both task and data parallelism are supported.
Handle failures gracefully.
Package dependencies automatically with support for Conda, PyPI, and uv.
Leverage distributed computing paradigms such as Ray, MPI, and Torch Distributed.
Checkpoint long-running tasks and manage checkpoints consistently.

Deployment

Maintain a clear separation between experimentation, production, and individual developers through namespaces.
Adopt CI/CD and GitOps best practices through branching.
Compose large, reactive systems through isolated sub-flows with event triggering.

These features provide a unified, user-facing API for the capabilities required by real-world ML and AI systems. Behind the scenes, Metaflow is built on integrations with production-quality infrastructure, effectively acting as a user-interface layer over platforms like Kubernetes - and now, Kubeflow. The diagram below illustrates the division of responsibilities:

The key benefit of the Metaflow–Kubeflow integration is that it allows organizations to keep their existing Kubernetes and Kubeflow infrastructure intact, while upgrading the developer experience with higher-level abstractions and additional functionality, provided by Metaflow.

Currently, the integration supports deploying Metaflow flows as Kubeflow Pipelines. Once you have Metaflow tasks running on Kubernetes, you can access other components such as Katib and Trainer from Metaflow tasks through their Python clients as usual.

Metaflow → Kubeflow in practice

As the integration requires no changes in your existing Kubeflow infrastructure, it is straightforward to get started. You can deploy Metaflow in an existing cloud account (GCP, Azure, or AWS) or you can install the dev stack on your laptop with a single command.

Once you have Metaflow and Kubeflow running independently, you can install the extension providing the integration (you can follow instructions in the documentation):

pip install metaflow-kubeflow

The only configuration needed is to point Metaflow at your Kubeflow Pipelines service, either by adding the following line in the Metaflow config or by setting it as an environment variable:

METAFLOW_KUBEFLOW_PIPELINES_URL = "http://my-kubeflow"

After this, you can author a Metaflow flow as usual and test it locally:

python flow.py run

which runs the flow quickly as local processes. If everything looks good, you can deploy the flow as a Kubeflow pipeline:

python flow.py kubeflow-pipelines create

This will package all the source code and dependencies of the flow automatically, compile the Metaflow flow into a Kubeflow Pipelines YAML and deploy it to Kubeflow, which you can see alongside your existing pipelines in the Kubeflow UI. The following screencast shows the process in action:

The integration doesn’t have 100% feature coverage yet: Some Metaflow features such as conditional and recursive steps are not yet supported. In future versions, we may also provide additional convenience APIs for other Kubeflow components, such as KServe - or you can easily implement them by yourself as custom decorators with the Kubeflow SDK!

If you want to learn more about the integration, you can watch an announcement webinar on Youtube.

Feedback welcome!

Like Kubeflow, Metaflow is an open-source project actively developed by multiple organizations — including Netflix, which maintains a dedicated team working on Metaflow, and Outerbounds, which provides a managed Metaflow platform deployed in customers’ own cloud environments.

The Metaflow community convenes at the Metaflow Slack. We welcome you to join, ask questions, and give feedback about the Kubeflow integration, and share your wishlist items for the roadmap. We are looking forward to a fruitful collaboration between the two communities!

Kubeflow AI Reference Platform 1.11 Release Announcement

2025-12-22T00:00:00+00:00

Kubeflow AI Reference Platform 1.11 delivers substantial platform improvements focused on scalability, security, and operational efficiency. The release reduces per namespace overhead, strengthens multi-tenant defaults, and improves overall reliability for running Kubeflow at scale on Kubernetes.

Highlight features

Trainer v2.1.0 with unified TrainJob API, Python-first workflows, and built-in LLM fine-tuning support
Multi-tenant S3 storage with per-namespace credentials, with SeaweedFS replacing MinIO as the default backend
Massive scalability improvements enabling Kubeflow deployments to scale to 1,000+ users, profiles, and namespaces
Zero pod overhead by default for namespaces and profiles, significantly reducing baseline resource consumption
Optimized Istio service mesh configuration to dramatically reduce sidecar memory usage and network traffic in large clusters
Stronger security defaults with Pod Security Standards (restricted for system namespaces, baseline for user namespaces)
Improved authentication and exposure patterns for KServe inference services, with automated tests and documentation
Expanded Helm chart support (experimental) to improve modularity and deployment flexibility
Updates across core components, including Kubeflow Pipelines, Katib, KServe, Model Registry, Istio, and Spark Operator

Kubeflow Platform (Manifests & Security)

The Kubeflow Platform Working Group focuses on simplifying Kubeflow installation, operations, and security. See details below.

Manifests:

Documentation updates that make it easier to install, extend and upgrade Kubeflow
For more details and future plans please check 1.12.0 roadmap.

Notebooks	Dashboard	Pipelines	Katib	Trainer	KServe	Model Registry	Spark
1.10	1.10	2.15.2	0.19.0	2.1.0	0.15.2	0.3.4	2.4.0

Kubernetes	Kind	Kustomize	Cert Manager	Knative	Istio	Dex	OAuth2-proxy
1.33+	0.30.0	5.7.1	1.16.1	1.20	1.28	2.43	7.10

Security:

Pod Security Standards enforced by default:
- restricted for all Kubeflow system namespaces
  (#3190, #3050)
- baseline for user namespaces
  (#3204, #3220)
Network policies enabled by default for critical system namespaces
(knative-serving, oauth2-proxy, cert-manager, istio-system, auth)
(#3228)
Improved multi-tenant isolation for object storage, with per-namespace S3 credentials
(#3240)
Authentication enforcement for KServe inference services
(#3180)

Trivy CVE scans December 15 2025:

Working Group	Images	Critical CVE	High CVE	Medium CVE	Low CVE
Katib	18	1	35	158	562
Pipelines	15	12	432	1051	1558
Workbenches(Notebooks)	12	39	312	525	267
Kserve	16	35	535	11929	1745
Manifests	15	6	105	256	55
Trainer	9	4	157	9012	728
Model Registry	3	3	75	132	36
Spark	1	4	22	1688	151
All Images	89	104	1673	24751	5102

Pipelines

This release of KFP introduces several notable changes that users should consider prior to upgrading. Comprehensive upgrade and documentation notes will follow shortly. In the interim, please note the following key modifications

Default object store update

Kubeflow Pipelines now defaults to SeaweedFS for the object store deployment, replacing the previous default of MinIO. MinIO remains fully supported, as does any S3-compatible object storage backend, only the default deployment configuration has changed.

Existing MinIO manifests are still available for users who wish to continue using MinIO, though these legacy manifests may be removed in future releases. Users with existing data are advised to back up and restore as needed when switching object store backends.

Database backend upgrade

This release includes a major upgrade to the Gorm database backend, which introduces an automated database index migration for users upgrading from versions prior to 2.15.0. Because this migration does not support rollback, it is strongly recommended that production databases be backed up before performing the upgrade.

Model Registry

Model Registry continues to mature with new capabilities for model discovery, governance, and deeper integration with the Kubeflow ecosystem.

Model Registry UI

The user-friendly web interface for centralized model metadata, version tracking, and artifact management now supports filtering, sorting, archiving, custom metadata, and metadata editing making it easier for teams to organize and govern their model lifecycle.

Model Catalog

A new Model Catalog feature enables model discovery and sharing with governance controls. A Model Catalog is a pattern where an organisation can define their validated and approved models, enabling discovery and sharing across teams, while at the same time ensuring model governance and compliance. Admin can define a number of catalog sources, filtering and enable model visibility, including Hugging Face. Teams can discover and use approved models from the organisation’s catalog. The catalog UI and backend are under active development.

KServe Integration

Custom Storage Initializer (CSI) Enables model download and deployment using model metadata directly from the Registry.
Reconciliation loop A deployable Kubernetes controller which observes KServe InferenceServices to automatically populate Model Registry logical-model records, keeping registry audit records of live deployments.

Storage Integrations

Python client workflows Data scientists can leverage convenience functions in the Python client to package, store, and register models and their metadata in a single playbook.
Async Upload Job A Kubernetes Job for transferring and packaging models (including KServe ModelCar OCI Image format), simplifying model storage operations in production environments, leveraging scaling and orchestration capabilities of Kubernetes without additional dependencies.

Additional Improvements

Removal of the legacy Google MLMD dependency.
PostgreSQL support alongside MySQL.
Multi-architecture container builds (amd64/arm64).
SBOM generation for container builds and OpenSSF Scorecard CI integration.

Training Operator (Trainer) & Katib

Kubeflow 1.11 includes Trainer v2.1.0, a major architectural evolution that simplifies distributed training on Kubernetes with a unified API, Python-first workflows, and enhanced LLM fine-tuning capabilities.

New API Architecture

Kubeflow Trainer v2 introduces TrainJob a unified training job API that replaces framework-specific CRDs (PyTorchJob, TFJob, etc.). Infrastructure configuration is now separated into TrainingRuntime and ClusterTrainingRuntime resources, creating a clean boundary between platform engineering (runtime setup) and data science (job submission).

Python-First Experience

No YAML required Install with pip install kubeflow and submit jobs directly from Python notebooks or scripts.
Local execution mode Develop and test training code locally without a Kubernetes cluster before scaling to production.
Helm Charts Deploy with helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0.

LLM Fine-Tuning

Built-in support for large language model fine-tuning workflows:

TorchTune trainer with pre-configured runtimes for Llama 3.2, Qwen 2.5, and more.
LoRA, QLoRA, and DoRA for parameter-efficient fine-tuning.
Dataset and model initializers for HuggingFace and S3 storage.

Distributed AI Data Cache

Optional in-memory cache cluster (powered by Apache Arrow and Apache DataFusion ) streams datasets directly to GPU nodes with zero-copy transfers, maximizing GPU utilization and minimizing I/O wait times for large-scale training workloads. More details can be found here.

Scheduler Integrations

Kueue Topology-aware scheduling and multi-cluster job dispatching for TrainJobs, enabling optimal placement for distributed training across node groups.
Volcano Gang-scheduling support with PodGroup integration.
MPI First-class support for MPI-based distributed training workloads on Kubernetes.

Katib

Katib hyperparameter tuning remains compatible with Trainer v2, allowing users to optimize model hyperparameters alongside the new training workflow.

A major addition is the integration with Kubeflow SDK (KEP-46, PR #124). The new OptimizerClient allows users to define and run hyperparameter experiments directly from Python notebooks without writing YAML. You can configure search spaces, objectives, and algorithms using OptimizerClient().optimize(). Each trial runs as a TrainJob with different hyperparameter values, and training code can report metrics using simple Python functions. The client includes standard methods for managing jobs: create_job(), get_job(), list_jobs(), and delete_job().

Spark Operator

The Spark Operator has received broad improvements in Kubeflow 1.11, spanning Spark version support, workload management, scheduling, and operational simplicity.

Broader Spark Support

The operator now supports Apache Spark 4 and introduces Spark Connect, enabling modern client–server Spark interactions. This allows users to connect to Spark sessions remotely and improves compatibility with the evolving Spark ecosystem.

Workload Management & Scheduling

Suspend / Resume SparkApplications Users can now suspend and resume jobs, giving greater control over workload lifecycle.
Kueue integration Integration with Kueue enables queue-based workload management and fair sharing of cluster resources across teams.
Enhanced dynamic allocation Improved shuffle tracking and dynamic allocation controls for more efficient resource usage.

Operations & Security

Automatic CRD upgrades Helm hooks now handle CRD upgrades automatically, reducing manual steps during upgrades.
Deprecation of sparkctl Legacy sparkctl has been deprecated in favor of kubectl-native workflows.
Flexible Ingress & cert-manager support More configurable Ingress (TLS, annotations, URL patterns) and simplified certificate handling via cert-manager.

Observability

Structured logging Configurable JSON and console log output formats.
Better validation Stricter validation of SparkApplication names and specs, catching misconfigurations earlier.

KServe

KServe in Kubeflow 1.11 delivers major improvements across model serving, inference capabilities, and operational maturity.

Multi-Node Inference

KServe now supports multi-node inference, enabling large models to be distributed across multiple nodes using Ray-based serving runtimes. This is critical for deploying very large language models that exceed single-node GPU capacity.

Model Cache Improvements

The Model Cache feature, introduced in v0.14, has been significantly hardened. Fixes include correct URI matching, protection against cache mismatches, support for multiple node groups, and PVC/PV retention after InferenceService deletion making model caching more reliable for production use.

KEDA Autoscaling Integration

KServe introduces integration with KEDA for event-driven autoscaling, including an external scaler implementation. This gives users more flexible scaling options beyond the built-in Knative and HPA-based autoscalers.

Gateway API Support

Raw deployment mode now supports the Kubernetes Gateway API, providing a modern, standardized alternative to Ingress for routing inference traffic.

vLLM & Hugging Face Runtime Updates

Upgraded vLLM to v0.8.1+ with support for reasoning models, tool calling, embeddings, reranking, and Llama 4 / Qwen 3.
vLLM V1 engine support and CPU inference via Intel Extension for PyTorch.
LMCache integration with vLLM for improved KV cache reuse.
Hugging Face runtime updates include 4-bit quantization support (bitsandbytes), speculative decoding, and deprecation of OpenVINO support.

Inference Graph Enhancements

InferenceGraphs now support pod spec fields (affinity, tolerations, resources) and well-known labels.
Improved Istio mesh compatibility and fixed response codes for conditional routing steps.

Operational & Security Improvements

ModelCar (OCI-based model loading) enabled by default.
Collocation of transformer and predictor containers in a single pod.
Stop-and-resume model serving via annotations (serverless mode).
Configurable label and annotation propagation to serving pods.
SBOM generation and third-party license inclusion for all images.
Multiple CVE fixes including CVE-2025-43859 and CVE-2025-24357.

Kubeflow SDK

Kubeflow 1.11 is the first AI Reference Platform release where users can simply pip install kubeflow to start working with AI workloads, no Kubernetes expertise required. The Kubeflow SDK provides a unified Python interface to train models, run hyperparameter tuning, and manage model artifacts across the Kubeflow ecosystem. It also enables local development without a Kubernetes cluster, so users can iterate on their training code locally before scaling to production. For documentation and examples, visit sdk.kubeflow.org.

Dashboard and Notebooks

The Kubeflow Central Dashboard and Notebooks remain at version 1.10 in this release, providing stable and reliable experiences. Stay tuned for interesting updates in upcoming Kubeflow AI Reference Platform releases.

How to get started with 1.11

Visit the Kubeflow AI Reference Platform 1.11 release page or head over to the Getting Started and Support pages.

Join the Community

We would like to thank everyone who contributed to Kubeflow 1.11, and especially Valentina Rodriguez Sosa for her work as the v1.11 Release Manager. We also extend our thanks to the entire release team and the working group leads, who continuously and generously dedicate their time and expertise to Kubeflow.

Release team members : Valentina Rodriguez Sosa, Anya Kramar, Tarek Abouzeid, Andy Stoneberg, Humair Khan, Matteo Mortari, Adysen Rothman, Jon Burdo, Milos Grubjesic, Vraj Bhatt, Dhanisha Phadate, Alok Dangre

Working Group leads : Andrey Velichkevich, Julius von Kohout, Mathew Wicks, Matteo Mortari

Kubeflow Steering Committee : Andrey Velichkevich, Julius von Kohout, Yuan Tang, Johnu George, Francisco Javier Araceo

You can find more details about Kubeflow distributions here.

Want to help?

Visit our Kubeflow website or Kubeflow GitHub Page.
Join the Kubeflow Slack channel.
Join the kubeflow-discuss mailing list.
Attend our weekly community meeting.

Introducing the Kubeflow SDK: A Pythonic API to Run AI Workloads at Scale

2025-11-07T00:00:00+00:00

⚡ We want your feedback! Help shape the future of Kubeflow SDK by taking our quick survey.

Unified SDK Concept

Scaling AI workloads shouldn’t require deep expertise in distributed systems and container orchestration. Whether you are prototyping on local hardware or deploying to a production Kubernetes cluster, you need a unified API that abstracts infrastructure complexity while preserving flexibility. That’s exactly what the Kubeflow Python SDK delivers.

As an AI Practitioner, you’ve probably experienced this frustrating journey: you start by prototyping locally, training your model on your laptop. When you need more compute power, you have to rewrite everything for distributed training. You containerize your code, rebuild images for every small change, write Kubernetes YAMLs, wrestle with kubectl, and juggle multiple SDKs — one for training, another for hyperparameter tuning, and yet another for pipelines. Each step demands different tools, APIs, and mental models.

All this complexity slows down productivity, drains focus, and ultimately holds back AI innovation. What if there was a better way?

The Kubeflow community started the Kubeflow SDK & ML Experience Working Group (WG) in order to address these challenges. You can find more information about this WG on our YouTube playlist.

Introducing Kubeflow SDK

The SDK sits on top of the Kubeflow ecosystem as a unified interface layer. When you write Python code, the SDK translates it into the appropriate Kubernetes resources — generating CRs, handling orchestration, and managing distributed communication. You get all the power of Kubeflow and distributed AI compute without needing to understand Kubernetes.

Getting started is simple:

pip install kubeflow

from kubeflow.trainer import TrainerClient

def train_model():
    import torch

    model = torch.nn.Linear(10, 1)
    optimizer = torch.optim.Adam(model.parameters())

    # Training loop
    for epoch in range(10):
        # Your training logic
        pass

    torch.save(model.state_dict(), "model.pt")

# Create a client and train
client = TrainerClient()
client.train(train_func=train_model)

The following principles are the foundation that guide the design and implementation of the SDK:

Unified Experience: Single SDK to interact with multiple Kubeflow projects through consistent Python APIs
Simplified AI Workloads: Abstract away Kubernetes complexity and work effortlessly across all Kubeflow projects using familiar Python APIs
Built for Scale: Seamlessly scale any AI workload — from local laptop to large-scale production cluster with thousands of GPUs using the same APIs.
Rapid Iteration: Reduced friction between development and production environments
Local Development: First-class support for local development without a Kubernetes cluster requiring only pip installation

Role in the Kubeflow Ecosystem

The SDK doesn’t replace any Kubeflow projects — it provides a unified way to use them. Kubeflow Trainer, Katib, Spark Operator, Pipelines, etc still handle the actual workload execution. The SDK makes them easier to interact with through consistent Python APIs, letting you work entirely in the language you already use for ML development.

This creates a clear separation:

AI Practitioners use the SDK to submit jobs and manage workflows through Python, without touching YAML or Kubernetes directly
Platform Administrators continue managing infrastructure — installing components, configuring runtimes, setting resource quotas. Nothing changes on the infrastructure side.

The Kubeflow SDK works with your existing Kubeflow deployment. If you already have Kubeflow Trainer and Katib installed, just pip install kubeflow and start using them through the unified interface. As Kubeflow evolves with new components and features, the SDK provides a stable Python layer that adapts alongside the ecosystem.

Project	Status	Description
Kubeflow Trainer	Available ✅	Train and fine-tune AI models with various frameworks
Kubeflow Optimizer	Available ✅	Hyperparameter optimization
Kubeflow Pipelines	Planned 🚧	Build, run, and track AI workflows
Kubeflow Model Registry	Planned 🚧	Manage model artifacts, versions and ML artifacts metadata
Kubeflow Spark Operator	Planned 🚧	Manage Spark applications for data processing and feature engineering

Key Features

Unified Python Interface

The SDK provides a consistent experience across all Kubeflow components. Whether you’re training models or optimizing hyperparameters, the APIs follow the same patterns:

from kubeflow.trainer import TrainerClient
from kubeflow.optimizer import OptimizerClient

# Initialize clients
trainer = TrainerClient()
optimizer = OptimizerClient()

# List jobs
TrainerClient().list_jobs()
OptimizerClient().list_jobs()

Trainer Client

The TrainerClient provides the easiest way to run distributed training on Kubernetes, built on top of Kubeflow Trainer v2. Whether you’re training custom models with PyTorch, or fine-tuning LLMs, the client provides a Python API for submitting and monitoring training jobs at scale.

The client works with pre-configured runtimes that Platform Administrators set up. These runtimes define the container images, resource policies, and infrastructure settings. As an AI Practitioner, you reference these runtimes and focus on your training code:

from kubeflow.trainer import TrainerClient, CustomTrainer

def get_torch_dist():
    """Your PyTorch training code runs on each node."""
    import os
    import torch
    import torch.distributed as dist

    dist.init_process_group(backend="gloo")
    print("PyTorch Distributed Environment")
    print(f"WORLD_SIZE: {dist.get_world_size()}")
    print(f"RANK: {dist.get_rank()}")
    print(f"LOCAL_RANK: {os.environ['LOCAL_RANK']}")

# Create the TrainJob
job_id = TrainerClient().train(
    runtime=TrainerClient().get_runtime("torch-distributed"),
    trainer=CustomTrainer(
        func=get_torch_dist,
        num_nodes=3,
        resources_per_node={
            "cpu": 2,
        },
    ),
)

# Wait for TrainJob to complete
TrainerClient().wait_for_job_status(job_id)

# Print TrainJob logs
print("\n".join(TrainerClient().get_job_logs(name=job_id)))

The TrainerClient supports CustomTrainer for your own training logic and BuiltinTrainer for pre-packaged training patterns like LLM fine-tuning.

Getting started with LLM fine-tuning is as simple as a single line. The default model, dataset, and training configurations are pre-baked into the runtime:

TrainerClient().train(
    runtime=TrainerClient().get_runtime("torchtune-qwen2.5-1.5b"),
)

You can also customize every aspect of the fine-tuning process — specify your own dataset, model, LoRA configuration, and training hyperparameters:

from kubeflow.trainer import TrainerClient, BuiltinTrainer, TorchTuneConfig
from kubeflow.trainer import Initializer, HuggingFaceDatasetInitializer, HuggingFaceModelInitializer
from kubeflow.trainer import TorchTuneInstructDataset, LoraConfig, DataFormat

client = TrainerClient()

client.train(
    runtime=client.get_runtime(name="torchtune-llama3.2-1b"),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
            access_token="hf_...",
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET,
            ),
            peft_config=LoraConfig(
                apply_lora_to_mlp=True,
                lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"],
                quantize_base=True,
            ),
            resources_per_node={
                "gpu": 1,
            }
        )
    )
)

You can mix and match — use the runtime’s default model but specify your own dataset, or keep the default dataset but customize the LoRA parameters. The Initializers download datasets and models once to shared storage, then all training pods access the data from there — reducing startup time and network usage.

For more details about Kubeflow Trainer capabilities, including gang-scheduling, fault tolerance, and MPI support, check out the Kubeflow Trainer v2 blog post.

Optimizer Client

The OptimizerClient manages hyperparameter optimization for large models of any size on Kubernetes. With consistent APIs across TrainerClient and OptimizerClient, you can easily transition from training to optimization — define your training job template once, specify which parameters to optimize, and the client orchestrates multiple trials to find the best hyperparameter configuration. This consistent API design significantly enhances the user experience during AI development.

The client launches trials in parallel according to your resource constraints, tracks metrics across experiments, and identifies optimal parameters.

First, define your training job template:

from kubeflow.trainer import TrainerClient, CustomTrainer
from kubeflow.optimizer import OptimizerClient, TrainJobTemplate, Search, Objective, TrialConfig

def train_func(learning_rate: float, batch_size: int):
    """Training function with hyperparameters."""
    # Your training code here
    import time
    import random

    for i in range(10):
        time.sleep(1)
        print(f"Training {i}, lr: {learning_rate}, batch_size: {batch_size}")

    print(f"loss={round(random.uniform(0.77, 0.99), 2)}")


# Create a reusable template
template = TrainJobTemplate(
    trainer=CustomTrainer(
        func=train_func,
        func_args={"learning_rate": "0.01", "batch_size": "16"},
        num_nodes=2,
        resources_per_node={"gpu": 1},
    ),
    runtime=TrainerClient().get_runtime("torch-distributed"),
)

# Verify that your TrainJob is working with test hyperparameters.
TrainerClient().train(**template)

Then optimize hyperparameters with a single call:

optimizer = OptimizerClient()

job_name = optimizer.optimize(
    # The same template can be used for Hyperparameter Optimisation
    trial_template=template,
    search_space={
        "learning_rate": Search.loguniform(0.001, 0.1),
        "batch_size": Search.choice([16, 32, 64, 128]),
    },
    trial_config=TrialConfig(
        num_trials=20,
        parallel_trials=4,
        max_failed_trials=5,
    ),
)

# Verify OptimizationJob was created
optimizer.get_job(job_name)

# Wait for OptimizationJob to complete
optimizer.wait_for_job_status(job_name)

# Get the best hyperparameters and metrics from an OptimizationJob
best_results = optimizer.get_best_results(job_name)
print(best_results)
# Output:
# Result(
#     parameters={'learning_rate': '0.0234', 'batch_size': '64'},
#     metrics=[Metric(name='loss', min='0.78', max='0.78', latest='0.78')]
# )

# See all the trials (TrainJobs) created during optimization
job = optimizer.get_job(job_name)
print(job.trials)

This creates multiple TrainJob instances (trials) with different hyperparameter combinations, executes them in parallel based on available resources, and tracks which parameters produce the best results. Each trial is a full training job managed by Kubeflow Trainer. Using Katib UI, you can visualize your optimization with an interactive graph that shows metric performance against hyperparameter values across all trials.

For more details about hyperparameter optimization, check out the OptimizerClient KEP.

Local Execution Mode

Local Execution Mode provides backend flexibility while maintaining full API compatibility with the Kubernetes backend, substantially reducing friction for AI practitioners when developing and iterating.

Choose the right execution environment for your stage of development:

Local Process Backend: Fastest Iteration

The Local Process Backend is your starting point for ML development - offering the fastest possible iteration cycle with zero infrastructure overhead. This backend executes your training code directly as a Python subprocess on your local machine, bypassing containers, orchestration, and network complexity entirely.

from kubeflow.trainer.backends.localprocess import LocalProcessBackendConfig

config = LocalProcessBackendConfig()
client = TrainerClient(config)

# Runs directly on your machine - no containers, no cluster
client.train(train_func=train_model)

Container Backend: Production-Like Environment

The Container Backend bridges the gap between local development and production deployment by bringing production parity to your laptop. This backend executes your training code inside containers (using Docker or Podman), ensuring that your development environment matches your production environment byte-for-byte - same dependencies, same Python version, same system libraries, same everything.

Docker Example:

from kubeflow.trainer.backends.container import ContainerBackendConfig

config = ContainerBackendConfig(
    container_runtime="docker",
    auto_remove=True  # Clean up containers after completion
)

client = TrainerClient(config)

# Launch 2-node distributed training locally
client.train(train_func=train_model, num_nodes=2)

Podman Example:

from kubeflow.trainer.backends.container import ContainerBackendConfig

config = ContainerBackendConfig(
    container_runtime="podman",
    auto_remove=True
)

client = TrainerClient(config)
client.train(train_func=train_model, num_nodes=2)

Kubernetes Backend: Production Scale

The Kubernetes Backend enables Kubeflow SDK to perform reliably at production scale - enabling you to deploy the exact same training code you developed locally to a production Kubernetes cluster with massive computational resources. This backend transforms your simple client.train() call into a full-fledged distributed training job managed by Kubeflow’s Trainer, complete with fault tolerance, resource scheduling, and cluster-wide orchestration.

Kubernetes Example:

from kubeflow.trainer.backends.kubernetes import KubernetesBackendConfig

config = KubernetesBackendConfig(
    namespace="ml-training",
)

client = TrainerClient(config)

# Scales to hundreds of nodes - the same code you tested locally
client.train(
    train_func=train_model,
    num_nodes=100,
    packages_to_install=["torch", "transformers"]
)

What’s Next?

We’re just getting started. The Kubeflow SDK currently supports Trainer and Optimizer, but the vision is much bigger — a unified Python interface for the entire Cloud Native AI Lifecycle.

Here’s what’s on the horizon:

Pipelines Integration: A PipelinesClient to build end-to-end ML workflows. Pipelines will reuse the core Kubeflow SDK primitives for training, optimization, and deployment in a single pipeline. The Kubeflow SDK will also power KFP core components
Model Registry Integration: Seamlessly manage model artifacts and versions across the training and serving lifecycle
Spark Operator Integration: Data processing and feature engineering through a SparkClient interface
Documentation: Full Kubeflow SDK documentation with guides, examples, and API references
Local Execution for Optimizer: Run hyperparameter optimization experiments locally before scaling to Kubernetes
Workspace Snapshots: Capture your entire development environment and reproduce it in distributed training jobs
Multi-Cluster Support: Manage training jobs across multiple Kubernetes clusters from a single SDK interface
Distributed Data Cache: In-memory caching for large datasets via initializer SDK configuration
Additional Built-in Trainers: Support for more fine-tuning frameworks beyond TorchTune — Unsloth, torchforge, Axolotl, LLaMA-Factory, and others

The community is driving these features forward. If you have ideas, feedback, or want to contribute, we’d love to hear from you!

Get Involved

The Kubeflow SDK is built by and for the community. We welcome contributions, feedback, and participation from everyone!

🔔 Help Shape the Future of Kubeflow SDK

We want to hear from you! Take our Kubeflow Unified SDK Survey to help us understand your biggest pain points and identify which new features will provide the most value to you and your team. Your feedback directly influences our roadmap and priorities.

Resources:

Connect with the Community:

Join #kubeflow-ml-experience on CNCF Slack
Attend the Kubeflow SDK and ML Experience WG meetings
Check out good first issues to get started

GSoC 2025: Meet Our Projects and Contributors 🚀

2025-09-06T00:00:00+00:00

Introduction

Google Summer of Code (GSoC) 2025 has been an exciting journey for the Kubeflow community! We are very grateful for Google and the open source community members dedication and effort.🎉
This year, 9 contributors from around the world collaborated with mentors to improve different parts of the Kubeflow ecosystem — from infrastructure and CI/CD, to notebooks, ML workflows, and beyond.

In this blog, we are highlighting all the projects that were part of GSoC 2025, their goals, the impact they’ve created, and the amazing contributors behind them.

👉 You can explore the full list on our GSoC 2025 page.

📚 Project Highlights

Below are the projects from this year’s GSoC. Each section includes a short summary, contributor details, and links to project resources.

Project 1: Kubeflow Platform Enhancements

Contributor: Harshvir Potpose (@akagami-harsh) Mentors: Julius von Kohout (@juliusvonkohout)

Overview:
We need an up to date S3 storage with hard multi-tenancy and run our containers with PodSecurityStandards restricted. MinIO transitioned to the AGPLv3 license in 2021, creating significant compliance challenges for the project.

This project addressed this critical blocker by implementing SeaweedFS as a production-ready replacement for MinIO. SeaweedFS offers a more permissive Apache 2.0 license while providing superior performance characteristics and enterprise-grade security and reliability.

Key Outcomes:

Provided S3 storage with hard multi-tenancy
Successfully migrated to SeaweedFS as a secure replacement for MinIO and integrated it into Kubeflow Pipelines
Eliminated MinIO’s licensing constraints by adopting SeaweedFS’s more permissive license model
Implemented comprehensive CI tests for SeaweedFS deployment and namespace isolation functionality
Strengthened the manifests repository’s CI pipeline and contributed to the dashboard migration efforts
Enforcing PodSecurityStandards baseline/restricted

Resources:

📄 Project Page
✍️ Personal Blog: Kubeflow Pipelines Embraces SeaweedFS

Project 2: KServe Models Web Application Modernization

Contributor: (GitHub: @LogicalGuy77)
Mentors: Griffin Sullivan (@Griffin-Sullivan), Julius von Kohout (@juliusvonkohout)

Overview:
This project revived and modernized the KServe Models Web Application (Angular + Flask), the UI used to manage machine learning inference services in Kubeflow via KServe. What began as a small Node.js update evolved into a comprehensive upgrade of the frontend stack, CI/CD, testing, and feature set—bringing the app up to modern standards and making it easier for both users and contributors to work with.

Key Outcomes:

Modernized core stack: upgraded Node.js (v16 → v23) and Angular (v12 → v14), resolving security issues and improving performance
Migrated container images from Docker Hub to GitHub Container Registry (GHCR) to avoid rate limits and improve reliability
Overhauled CI/CD with GitHub Actions: updated actions, added intelligent caching for pip, Docker layers, and node_modules for significantly faster builds
Introduced Jest unit tests for core utilities (e.g., parsing Kubernetes object statuses and KServe predictor configs)
Added Cypress end-to-end tests for critical user journeys (deploy, edit, delete) including failure handling and input validation
Wrote comprehensive documentation to help contributors run and extend the test suites
Shipped “Edit InferenceService YAML” directly in the UI via an integrated Monaco editor—no kubectl required
Fixed RawDeployment-mode crash and added ModelMesh support so resources and statuses render correctly
Added support for the latest KServe predictor runtimes, including HuggingFace
Simplified contributor onboarding with a Makefile that automates full frontend setup in a single command
Implemented runtime-configurable settings via a new /api/config endpoint (e.g., Grafana DB names, URL prefixes)
Cut the v0.15.0 release of the Models Web App, consolidating months of modernization and feature work

By the Numbers:

PRs merged: 19
Issues closed: 8
Lines of code changed: +22,309 / −11,628
Frontend: Angular, TypeScript, SCSS
Backend: Flask (Python)
CI/CD: GitHub Actions, Docker
Local cluster: Kubernetes (Kind) + Istio + Kubeflow

Resources:

Project 3: Istio CNI and Ambient Mesh

Contributor: Ayush Gupta (GitHub: @madmecodes)
Mentors: Julius von Kohout (@juliusvonkohout), Kimonas Sotirchos (@kimwnasptd)

Overview:
This GSoC 2025 project modernized Kubeflow’s service mesh infrastructure by implementing Istio CNI as the default configuration and pioneering Istio Ambient Mesh support. The 175-hour medium-difficulty project involved 25+ pull requests across multiple Kubeflow repositories, transitioning from traditional sidecar-based architecture to ambient mesh with ztunnel and waypoint proxies, pioneering the migration to Gateway API (HTTPRoute), implementing path-based routing for KServe model serving endpoints, and utilizing Kustomize overlay method for easy installation and configuration management.

Key Outcomes:

Implemented Istio CNI by default with Kustomize overlay method enabling easy switching between traditional Istio and CNI configurations
Created path-based routing for KServe multi-model serving and Gateway API (HTTPRoute) migration
Pioneered Ambient Mesh support with ztunnel/waypoint proxies and coordinating cross-repository compatibility

Resources:

📄 Project Page
✍️ Blog Post

Project 4: Deploying Kubeflow with Helm Charts

Contributor: Kunal Dugar (@kunal-511)
Mentors: Julius von Kohout (@juliusvonkohout), Valentina Rodriguez Sosa (@varodrig), Chase Cadet (@Chasecadet)

Overview:
This project focused on creating component-based Helm charts for Kubeflow, enabling flexible and incremental deployment of ML infrastructure. Instead of requiring a full platform installation, users can now deploy specific components like Katib, Pipelines, Model Registry, and others independently with customized configurations.

Key Outcomes:

Kubeflow AI reference platform end to end testing
Created production-ready Helm charts for Katib, Model Registry, KServe Web App, Notebook Controller, and Kubeflow Pipelines—enabling one-command deployment of individual components
Built automated testing infrastructure with diff tools to validate Helm charts against Kustomize manifests, ensuring accuracy and catching regressions quickly
Enabled incremental Kubeflow adoption, reducing deployment complexity from days to hours for organizations building production ML platforms

Resources:

Project 5: JupyterLab Plugin for Kubeflow

Contributor: Amrit Kumar (@Amrit27k)
Mentors: Eder Ignatowicz (@ederign), Stefano Fioravanzo (@StefanoFioravanzo)

Overview: The project fully modernized Kubeflow Kale’s architecture, migrating the backend from KFPv1 to KFPv2 with a new Jinja2 templating system for notebook-to-pipeline conversion. The initiative also featured a complete overhaul of the JupyterLab frontend (Typescriptv5.9.2, MUIv7) and comprehensive updates to GitHub workflows, documentation, and dependencies to meet modern community standards.

Key Outcomes:

Rebuilt the Kale backend to support the modern, future-proof Kubeflow Pipelines v2 (KFPv2) architecture, moving away from the deprecated KFPv1.
Implemented a new Jinja2 templating system that intelligently converts annotated Jupyter notebook cells into valid KFPv2 Python DSL scripts.
Updated the JupyterLab frontend extension using current standards (Typescript v5.9.2, Jupyterlab v4, and MUI v7), resolving hundreds of legacy compatibility issues.
Integrated KFPv2’s robust system for better type-safe artifact handling and automated ML Metadata registration, ensuring rich lineage tracking for pipeline steps.
Standardized the project structure, updated GitHub workflows, and implemented UI test scripts to align with community standards and ensure maintainability for future contributors.

Resources:

Project 6: Spark Operator with Kubeflow Notebooks

Contributor: Fellipe Resende (@fresende)
Mentors: Shekhar Rajak (@Shekharrajak), Luciano Resende (@lresende), Chaoran Yu (@yuchaoran2011), Andrey Velichkevich (@andreyvelich)

Overview: This project enables seamless PySpark execution within Kubeflow Notebooks by integrating the Spark Operator and Jupyter Enterprise Gateway. It allows data scientists to run distributed machine learning and big data workloads directly from their notebooks on Kubernetes, simplifying workflows and eliminating Spark infrastructure overhead, improving both usability and scalability within the Kubeflow ecosystem.

Key Outcomes:

Extended Kubeflow Notebooks to enable seamless integration with Spark via Spark Operator leveraging Jupyter Enterprise Gateway to manage the spark application lifecycle.
Enable data scientists and ML engineer to run distributed big-data workloads directly in Spark, from inside Kubeflow Notebooks, without manual cluster setup.
Provided documentation and guidance for setting up, configuring, and customizing Kubeflow Notebook environments integrated with the Spark Operator, enabling users to run scalable distributed Spark workloads directly from Jupyter-based workflows.

Resources:

Project 7: GPU Testing for LLM Blueprints

Contributor: Akash Jaiswal (@jaiakash)
Mentors: Andrey Velichkevich (@andreyvelich), Valentina Rodriguez Sosa(@varodrig)

Overview:
We had a few examples in the repository that we wanted to include in our end-to-end (E2E) tests, but all of them were CPU-based. Projects like Torchtune and Qwen 2.5, for instance, require GPU resources to run — yet our existing CI setup couldn’t validate them at all because it was entirely CPU-focused.

This created a major gap: whenever someone contributed a new LLM example or modified the trainer logic, we had no automated way to verify if those changes would work in a GPU environment — the same environment where these workloads are actually deployed in production.

The goal of this project was to add CI with GPU support directly into our CI/CD workflow.

Key Outcomes:

Integrating GPU runners into GitHub Actions so that any pull request could automatically trigger GPU-backed E2E tests.
Making the setup scalable and cost-efficient — instead of maintaining expensive GPU machines 24/7, we needed an on-demand system that provisions GPU resources only when a test is triggered.

Resources:

📄 Project Page
🧩 Kubeflow Enhancement Proposal (KEP)
✍️ Personal Blog: Scaling GPU Testing for LLM Blueprints

Project 10: Support Volcano Scheduler in Kubeflow Trainer

Contributor: Xinmin Du (GitHub: @Doris-xm)
Mentors: Shao Wang (@Electronic-Waste), Yuchen Cheng(@rudeigerc)

Overview:
The project aims to integrate the Volcano scheduler into Kubeflow Trainer as a runtime plugin. This will allow users to take advantage of advanced AI-specific scheduling features, such as Gang Scheduling and priority scheduling, supported by Volcano.

Key Outcomes:

Integrate the Volcano scheduler into Trainer as a runtime plugin to support Gang Scheduling and resource management for distributed training jobs.
Enabled AI-specific features such as priority scheduling, queue-based management, and network topology–aware scheduling.

Resources:

📄 Project Page
🧩 Kubeflow Enhancement Proposal (KEP)

Project 12: Empowering Kubeflow Documentation with LLMs 🤖

Contributor: Santhosh Toorpu (GitHub: @SanthoshToorpu)
Mentors: Francisco Javier Arceo (@franciscojavierarceo), Chase Cadet (@Chasecadet)

Overview:
This project introduced an intelligent documentation assistant that uses Retrieval-Augmented Generation (RAG) and KServe-hosted LLMs to enhance the Kubeflow documentation experience. The goal was to help users find relevant, accurate answers drawn from Kubeflow docs, GitHub issues, and community discussions — all through a conversational interface on the Kubeflow website.

The system leverages Kubeflow Pipelines to automate documentation ingestion and indexing, Milvus for semantic vector search, and FastAPI with WebSockets for real-time interactions. Built on Kubernetes, the architecture follows Kubeflow’s MLOps principles end-to-end — from automated retraining and indexing to monitored LLM inference served via KServe.

Key Outcomes:

Designed and deployed an LLM-powered Documentation Assistant using Kubeflow-native tools (KFP, KServe, Feast, Milvus).
Implemented automated documentation indexing pipelines triggered by GitHub Actions to keep vector embeddings up-to-date.
Developed an interactive chat interface integrated into the Kubeflow website for natural-language documentation search.
Introduced a RAG agentic workflow with tool-calling to decide when to retrieve external documentation or use model knowledge.
Implemented RBAC-based access control for pipelines and KServe endpoints to align with Kubeflow’s multi-user isolation standards.
Developed a feedback loop system (“👍 / 👎”) to improve the model’s performance and documentation quality.
Delivered a functional prototype hosted on Kubernetes, showcasing real-time semantic search across Kubeflow repositories.

Resources:

📄 Project Page
🧠 Demo Repo
✍️ Blog Post: Empowering Kubeflow Documentation with LLMs

🎉 Wrapping Up

We are proud of what our GSoC 2025 contributors achieved and the impact they have made on the Kubeflow ecosystem. Their work not only strengthens existing components but also lays the foundation for future innovation in MLOps and AI infrastructure.

A huge thank you 🙏 to all contributors, mentors, and community members who made this program a success.

👩‍💻 Want to Get Involved?

The Kubeflow community is open to contributors of all backgrounds and skill levels. Whether you are passionate about ML infrastructure, frontend, DevOps, or documentation — there’s a place for you here.

💻 Visit our website and GitHub
💬 Join our Slack
🗓️ Attend the community calls
📩 Subscribe to the kubeflow-discuss mailing list

Let’s continue building the future of MLOps together 🚀

KubeCon India 2025 with Kubeflow: Our Community Experience

2025-08-23T00:00:00+00:00

Introduction

KubeCon + CloudNativeCon India 2025 in Hyderabad was an absolute blast! As a second-time attendee (Akash Jaiswal) and a first-time attendee (Yash Pal), we couldn’t help but be blown away by the incredible energy at one of world’s biggest cloud native gatherings. We were super excited seeing Kubeflow get a special shoutout during the opening keynote for its role in cloud native AI/ML and MLOps - definitely made us proud to be part of the community! (Above image shows the keynote moment)

We also got super lucky with the chance to volunteer at the Kubeflow booth this year. We also met Johnu George in person, who delivered two amazing talks on Kubeflow’s latest capabilities. It was really exciting to finally meet community members face-to-face whom we’ve only seen in community calls and Slack!

This blog shares all the exciting bits from our packed 2 days at KubeCon - from awesome booth conversations to technical deep-dives. We hope this motivates more community members to not just contribute but also attend and help Kubeflow at events like KubeCon. Trust me, you won’t want to miss the next one! 😊

Featured Talks

Cloud Native GenAI using KServe and OPEA Speakers: Johnu George, Gavrish Prabhu (Nutanix) Sched Link: View on Sched

Bridging Big Data and Machine Learning Ecosystems Speakers: Johnu George, Shiv Jha (Nutanix) Sched Link: View on Sched

Kubeflow Booth Highlights

Here’s a picture of our Kubeflow booth volunteer team. It was really great to meet and interact with audiences who had dozens of questions about Kubeflow, contributors who wanted to help, and developers who were already using it and shared their experiences.

Here are some key highlights from our booth conversations:

Community Engagement:
- Discussions on real-world use cases and deployment strategies. Few users shared their experience of using Kubeflow in their companies and how its benefiting them.
- Many of the audience wants to learn more about how to explore and contribute to Kubeflow. (Answers: Join community calls, and check out GitHub for open issues)
- Several companies expressed interest in adopting projects like Kubeflow. Few senior engineers were already using it for some of their workloads, now they want to use it for production workload.
Popular Questions from Audience:
- How does Kubeflow simplify ML workflows using Kubernetes? Can you clarify why Kubeflow is not multicluster agnostic? Answer: You can just send your jobs to 5 different and independent Kubeflow clusters if you want to. So We do not think that this is needed at all. We offer APIs for external access (KFP, everything you can also do in the UI) so we do not need the Kubeflow deployment to span multiple clusters directly. If you want to span multiple regions then either use the API of multiple independent Kubeflow clusters in different regions and just submit your jobs or use a Kubernetes layer that transparently handles clusters spanning multiple regions. But nevertheless adding this complexity burden on Kubeflow does not offer much benefit.
- How does Kubeflow integrate with other cloud-native tools? How is Kubeflow different from other tools in the industry?
- What are the security considerations for running ML pipelines? How can Kubeflow help optimize costs when working with LLMs, especially in terms of minimizing GPU usage to stay within quota limits while still delivering performance?
- How mature is Kubeflow today, and how well does it align with the workflows of different MLOps? What is the timeline of graduation for Kubeflow? What does the roadmap for Kubeflow look like?
- Why has Kubeflow chosen to integrate with ArgoCD rather than Tekton CD — the question that came up from a maintainer of the Tekton project.

Our experience

What an incredible journey these past two days have been! Beyond the technical talks and booth duties, what really stood out was the genuine excitement around Kubeflow in the community. Seeing users’ faces light up when sharing their success stories, or watching newcomers get that “aha!” moment during demos - these are the moments that make community events special.

The technical discussions were mind-blowing too. From hearing how startups are using Kubeflow to train their LLMs, to learning how enterprises are scaling it across thousands of models - each conversation taught us something new. We even got into some heated (but friendly!) debates about MLOps architectures and the future of AI on Kubernetes.

But the best part? The people. Meeting community members we’ve only known through Slack emojis and GitHub comments to sharing chai/biryani with fellow contributors. These personal connections are what make the open source community truly special. Can’t wait for the next one! 🚀

Want to help?

The Kubeflow community holds open meetings and is always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!

Visit our website or GitHub page.
Join the Kubeflow Slack channels.
Join the kubeflow-discuss mailing list.
Want to volunteer for such events, Join the kubeflow-outreach channel on CNCF Slack.
Attend our weekly community meeting.

Feel free to share your thoughts or questions in the comments!

Democratizing AI Model Training on Kubernetes: Introducing Kubeflow Trainer V2

2025-07-21T00:00:00+00:00

Running machine learning workloads on Kubernetes can be challenging. Distributed training and LLMs fine-tuning, in particular, involves managing multiple nodes, GPUs, large datasets, and fault tolerance, which often requires deep Kubernetes knowledge. The Kubeflow Trainer v2 (KF Trainer) was created to hide this complexity, by abstracting Kubernetes from AI Practitioners and providing the easiest, most scalable way to run distributed PyTorch jobs.

The main goals of Kubeflow Trainer v2 include:

Make AI/ML workloads easier to manage at scale
Provide a Pythonic interface to train models
Deliver the easiest and most scalable PyTorch distributed training on Kubernetes
Add built-in support for fine-tuning large language models
Abstract Kubernetes complexity from AI Practitioners
Consolidate efforts between Kubernetes Batch WG and Kubeflow community

We’re deeply grateful to all contributors and community members who made the Trainer v2 possible with their hard work and valuable feedback. We’d like to give special recognition to andreyvelich, tenzen-y, electronic-waste, astefanutti, ironicbo, mahdikhashan, kramaranya, harshal292004, akshaychitneni, chenyi015 and the rest of the contributors. We would also like to highlight ahg-g, kannon92, and vsoch whose feedback was essential while we designed the Kubeflow Trainer architecture together with the Batch WG. See the full contributor list for everyone who helped make this release possible.

Background and Evolution

Kubeflow Trainer v2 represents the next evolution of the Kubeflow Training Operator, building on over seven years of experience running ML workloads on Kubernetes. The journey began in 2017 when the Kubeflow project introduced TFJob to orchestrate TensorFlow training on Kubernetes. At that time, Kubernetes lacked many of the advanced batch processing features needed for distributed ML training, so the community had to implement these capabilities from scratch.

Over the years, the project expanded to support multiple ML frameworks including PyTorch, MXNet, MPI, and XGBoost through various specialized operators. In 2021, these were consolidated into the unified Training Operator v1. Meanwhile, the Kubernetes community introduced the Batch Working Group, developing important APIs like JobSet, Kueue, Indexed Jobs, and PodFailurePolicy that improved HPC and AI workload management.

Trainer v2 leverages these Kubernetes-native improvements to make use of existing functionality and not reinvent the wheel. This collaboration between the Kubernetes and Kubeflow communities delivers a more standardized approach to ML training on Kubernetes.

User Personas

One of the main challenges with ML training on Kubernetes is that it often requires AI Practitioners to have an understanding of Kubernetes concepts and the infrastructure being used for training. This distracts AI Practitioners from their primary focus.

The KF Trainer v2 addresses this by separating the infrastructure configuration from the training job definition. This separation is built around three new custom resources definitions (CRDs):

TrainingRuntime - a namespace-scoped resource that contains the infrastructure details that are required for a training job, such as the training image to use, failure policy, and gang-scheduling configuration.
ClusterTrainingRuntime - similar to TrainingRuntime, but cluster scoped.
TrainJob - specifies the training job configuration, including the training code to run, config for pulling the training dataset & model, and a reference to the training runtime.

The diagram below shows how different personas interact with these custom resources:

Platform Administrators define and manage the infrastructure configurations required for training jobs using TrainingRuntimes or ClusterTrainingRuntimes.
AI Practitioners focus on model development using the simplified TrainJob resource or Python SDK wrapper, providing a reference to the training runtime created by Platform Administrators.

Python SDK

The KF Trainer v2 introduces a redesigned Python SDK, which is intended to be the primary interface for AI Practitioners. The SDK provides a unified interface across multiple ML frameworks and cloud environments, abstracting away the underlying Kubernetes complexity.

The diagram below illustrates how Kubeflow Trainer provides a consistent experience for running ML jobs across different ML frameworks, Kubernetes infrastructures, and cloud providers:

Kubeflow Trainer v2 supports multiple ML frameworks through pre-configured runtimes. The table below shows the current framework support:

The SDK makes it easier for users familiar with Python to create, manage, and monitor training jobs, without requiring them to deal with any YAML definitions:

from kubeflow.trainer import TrainerClient

client = TrainerClient()

def my_train_func():
    """User defined function that runs on each distributed node process"""
    import os
    import torch
    import torch.distributed as dist
    from torch.utils.data import DataLoader, DistributedSampler
    
    # Setup PyTorch distributed
    backend = "nccl" if torch.cuda.is_available() else "gloo"
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    dist.init_process_group(backend=backend)
    
    # Define your model, dataset, and training loop
    model = YourModel()
    dataset = YourDataset()
    train_loader = DataLoader(dataset, sampler=DistributedSampler(dataset))
    
    # Your training logic here
    for epoch in range(num_epochs):
        for batch in train_loader:
            # Forward pass, backward pass, optimizer step
            ...
            
    # Wait for the distributed training to complete
    dist.barrier()
    if dist.get_rank() == 0:
        print("Training is finished")

    # Clean up PyTorch distributed
    dist.destroy_process_group()

job_name = client.train(
  runtime=client.get_runtime("torch-distributed"),
  trainer=CustomTrainer(
    func=my_train_func,
    num_nodes=5,
    resources_per_node={
      "gpu": 2,
     },
  ),
)

job = client.get_job(name=job_name)

for step in job.steps:
   print(f"Step: {step.name}, Status: {step.status}")

client.get_job_logs(job_name, follow=True)

The SDK handles all Kubernetes API interactions. This eliminates the need for AI Practitioners to directly interact with the Kubernetes API.

Simplified API

Previously, in the Kubeflow Training Operator users worked with different custom resources for each ML framework, each with their own framework-specific configurations. The KF Trainer v2 replaces these multiple CRDs with a unified TrainJob API that works with multiple ML frameworks.

For example, here’s how a PyTorch training job looks like using KF Trainer v1:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-simple
  namespace: kubeflow
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"

In the KF Trainer v2, creating an equivalent job becomes much simpler:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: pytorch-simple
  namespace: kubeflow
spec:
  trainer:
    numNodes: 2
    image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
    command:
      - "python3"
      - "/opt/pytorch-mnist/mnist.py"
      - "--epochs=1"
  runtimeRef:
    name: torch-distributed
    apiGroup: trainer.kubeflow.org
    kind: ClusterTrainingRuntime

Additional infrastructure and Kubernetes-specific details are provided in the referenced runtime definition, and managed separately by Platform Administrators. In the future, we might support other runtimes in addition to TrainingRuntime and ClusterTrainingRuntime, for example SlurmRuntime.

Extensibility and Pipeline Framework

One of the challenges in KF Trainer v1 was supporting additional ML frameworks, especially for closed-sourced frameworks. The v2 architecture addresses this by introducing a Pipeline Framework that allows Platform Administrators to extend the Plugins and support orchestration for their custom in-house ML frameworks.

The diagram below shows Kubeflow Trainer Pipeline Framework overview:

The framework works through a series of phases - Startup, PreExecution, Build, and PostExecution - each with extension points where custom Plugins can hook in. This approach allows adding support for new frameworks, custom validation logic, or specialized training orchestration without changing the underlying system.

LLMs Fine-Tuning Support

Another improvement of Trainer v2 is its built-in support for fine-tuning large language models, where we provide two types of trainers:

BuiltinTrainer - already includes the fine-tuning logic and allows AI Practitioners to quickly start fine-tuning requiring only parameter adjustments.
CustomTrainer - allows users to provide their own training function that encapsulates the entire LLMs fine-tuning.

In the first release, we support TorchTune LLM Trainer as the initial option for BuiltinTrainer. For TorchTune, we provide pre-configured runtimes (ClusterTrainingRuntime) that currently support Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct in the manifest. This approach means that in the future, we can add more frameworks, such as unsloth, as additional BuiltinTrainer options. Here’s an example using the BuiltinTrainer with TorchTune:

job_name = client.train(
    runtime=Runtime(
        name="torchtune-llama3.2-1b"
    ),
    initializer=Initializer(
        dataset=HuggingFaceDatasetInitializer(
            storage_uri="hf://tatsu-lab/alpaca/data"
        ),
        model=HuggingFaceModelInitializer(
            storage_uri="hf://meta-llama/Llama-3.2-1B-Instruct",
            access_token=""  # Replace with your Hugging Face token,
        )
    ),
    trainer=BuiltinTrainer(
        config=TorchTuneConfig(
            dataset_preprocess_config=TorchTuneInstructDataset(
                source=DataFormat.PARQUET,
            ),
            resources_per_node={
                "gpu": 1,
            }
        )
    )
)

This example uses a builtin runtime image that uses a foundation Llama model, and fine-tunes it using a dataset pulled from Hugging Face, with the TorchTune configuration provided by the AI Practitioner. For more details, please refer to this example.

Dataset and Model Initializers

Trainer v2 provides dedicated initializers for datasets and models, which significantly simplify the setup process. Instead of each training pod independently downloading large models and datasets, initializers handle this once and share the data across all training nodes through a shared volume.

This approach saves both time and resources by preventing network slowdowns, and reducing GPU waiting time during setup by offloading data loading tasks to CPU-based initializers, which preserves expensive GPU resources for the actual training.

Use of JobSet API

Under the hood, the KF Trainer v2 uses JobSet, a Kubernetes-native API for managing groups of jobs. This integration allows the KF Trainer v2 to better utilize standard Kubernetes features instead of trying to recreate them.

Kueue Integration

Resource management is improved through integration with Kueue, a Kubernetes-native queueing system. The KF Trainer v2 includes initial support for Kueue through Pod Integration, which allows individual training pods to be queued when resources are busy. We are working on native Kueue support for TrainJob to provide richer queueing features in future releases.

MPI Support

The KF Trainer v2 also provides MPI v2 support, which includes automatic generation of SSH keys for secure inter-node communication and boosting performance MPI on Kubernetes.

The diagram above shows how this works in practice - the KF Trainer automatically handles the SSH key generation and MPI communication between training pods, which allows frameworks like DeepSpeed to coordinate training across multiple GPU nodes without requiring manual configuration of inter-node communication.

Gang-Scheduling

Gang-scheduling is an important feature for distributed training that ensures all pods in a training job are scheduled together or not at all. This prevents scenarios where only some pods are scheduled while others remain pending due to resource constraints, which would waste GPU resources and prevent training from starting.

The KF Trainer v2 provides built-in gang-scheduling support through PodGroupPolicy API. This creates PodGroup resources that ensure all required pods can be scheduled simultaneously before the training job starts.

Platform Administrators can configure gang-scheduling in their TrainingRuntime or ClusterTrainingRuntime definitions. Here’s an example:

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
  name: torch-distributed-gang-scheduling
spec:
  mlPolicy:
    numNodes: 3
    torch:
      numProcPerNode: 2
  podGroupPolicy:
    coscheduling:
      scheduleTimeoutSeconds: 120
  # ... rest of runtime configuration

Currently, KF Trainer v2 supports the Co-Scheduling plugin from Kubernetes scheduler-plugins project. Volcano and KAI scheduler support is coming in future releases to provide more advanced scheduling capabilities.

Fault Tolerance Improvements

Training jobs can sometimes fail due to node issues or other problems. The KF Trainer v2 improves handling these faults by supporting Kubernetes PodFailurePolicy, which allows users to define specific rules for handling different types of failures, such as restarting the job after temporary node issues or terminating the job after critical errors.

What’s Next?

Future enhancements will continue to improve the user experience, integrate deeper with other Kubeflow components, and support more training frameworks. Upcoming features include:

Local Execution - run training jobs locally without Kubernetes
Unified Kubeflow SDK - a single SDK for all Kubeflow projects
Trainer UI - a user interface to expose high level metrics for training jobs and monitor training logs
Native Kueue integration - improve resource management and scheduling capabilities for TrainJob resources
Model Registry integrations - export trained models directly to Model Registry
Distributed Data Cache - in-memory Apache Arrow caching for tabular datasets
Volcano support - advanced AI-specific scheduling with gang scheduling, priority queues, and resource management capabilities
JAX runtime support - ClusterTrainingRuntime for JAX distributed training
KAI Scheduler support - NVIDIA’s GPU-optimized scheduler for AI workloads

Migration from Training Operator v1

For users migrating from Kubeflow Training Operator v1, check out a Migration Guide.

Resources and Community

For more information about Trainer V2, check out the Kubeflow Trainer documentation and the design proposal for technical implementation details.

For more details about Kubeflow Trainer, you can also watch our KubeCon presentations:

Join the community via the #kubeflow-trainer channel on CNCF Slack, or attend the AutoML and Training Working Group meetings to contribute or ask questions. Your feedback, contributions, and questions are always welcome!

Kubeflow

Kubeflow AI Reference Platform 26.03 Release Announcement

Highlight features

Kubeflow Platform (Manifests & Security)

Manifests:

Security:

Pipelines

Model Registry

Training Operator (Trainer) & Katib

Spark Operator

KServe

Kubeflow SDK

Dashboard and Notebooks

How to get started with 26.03

Join the Community

Want to help?

Modernizing Kubeflow Pipelines UI

What’s changing for users

A faster, more responsive interface

Smoother pipeline graph navigation

Improved charts and metrics display

Better accessibility

No breaking changes

Why we made this change

How we got here

React 16 → 17: Rebuilding the foundation

React 17 → 18: The biggest leap

React 18 → 19: The final stretch

The full stack transformation

By the numbers

Want to contribute?

Kubeflow Trainer v2.2: JAX & XGBoost Runtimes, Flux for HPC Support, and TrainJob progress and metrics observability

Bringing JAX to Kubernetes with Trainer

Bringing XGBoost to Kubernetes with Trainer

Track TrainJob Progress and Expose Metrics

How it works

Future Plans

Bringing Flux Framework for HPC and MPI Bootstrapping

Resource Timeout for TrainJobs

RuntimePatches API to override TrainJob defaults

Breaking Changes

Replace PodTemplateOverrides with RuntimePatches API

Remove numProcPerNode from the Torch MLPolicy API

Remove ElasticPolicy API

Some TrainJob API fields are now immutable

Release Notes

Roadmap Moving Forward

Join the Community

Contribute:

Connect with the Community:

Learn More:

Kubeflow SDK v0.4.0: Model Registry, SparkConnect, and Enhanced Developer Experience

Unified Model Management: The Model Registry Client

Usage Example

Distributed AI Data at Scale: SparkClient & SparkConnect

Usage Example

A New Home for Documentation

Infrastructure & Breaking Changes

Better Isolation with Namespaced TrainingRuntimes

Furthering Parity Between Local and Remote Execution

Required: Upgrading to Python 3.10+

What’s Next for Kubeflow SDK

Get Involved!

Introducing the Metaflow-Kubeflow Integration

A tale of two flows: Metaflow and Kubeflow

Why Metaflow → Kubeflow

Development

Scaling

Deployment

Metaflow → Kubeflow in practice

Feedback welcome!

Kubeflow AI Reference Platform 1.11 Release Announcement

Highlight features

Kubeflow Platform (Manifests & Security)

Manifests:

Security:

Pipelines

Default object store update

Database backend upgrade

Model Registry