DIVERSE — Disagreement-Inducing Vector Evolution
Same accuracy, different answers
A fast way to find the many equally-good models hiding inside one trained network — no retraining, no gradients.

Train a classifier and you get one model. But many other models reach the very same accuracy by reasoning differently — and they can disagree on individual cases. That hidden family is a network's Rashomon set. DIVERSE maps it by freezing the trained model, adding small bounded "knobs", and evolving a tiny latent vector to surface accurate-but-different variants in minutes.

ICLR 2026 conference paper MNIST · CIFAR-10 · PneumoniaMNIST (+ a ViT) Gradient-free · weights stay frozen
The gap

Equally good models can quietly disagree

Picture a model that is 95% accurate at spotting a disease. There can be hundreds of other models, all 95% accurate, that decide differently on specific patients — one leans on shape, another on texture. Knowing this whole set matters for trust, fairness, and uncertainty. The catch is that finding it has been slow and hard to control.

The usual options

Expensive or uncontrolled

  • Retrain from scratch with new seeds — hours of compute, and equal accuracy isn't guaranteed.
  • Adversarial weight perturbation — scales poorly; costly for every sample-and-class pair.
  • Dropout sampling — fast, but diversity is uncontrolled and judged on the test set.
INSTEAD
DIVERSE

Modulate, don't retrain

  • Freeze the trained model and add bounded FiLM "knobs".
  • Evolve a tiny latent vector with gradient-free CMA-ES.
  • Tune how different the variants are — explicit diversity, in minutes.

The key reframe: instead of searching the network's enormous weight space, DIVERSE searches a small latent modulation space wrapped around the frozen model.

Contributions

What the paper adds

C1

A retraining-free Rashomon explorer

FiLM modulation plus CMA-ES finds diverse, accurate variants of a frozen model — with no retraining and no gradient access.

C2

Explicit, tunable diversity

A disagreement-aware fitness steers how different the variants are, while a soft penalty keeps them inside the accuracy tolerance.

C3

Validated across architectures

On an MLP, VGG-16 and ResNet-50 (plus a preliminary Vision Transformer), DIVERSE matches or beats dropout's diversity at far less cost than retraining.

Try it · interactive

Sweep the latent vector, watch the Rashomon set

Every model below is equally accurate, but each draws the boundary a little differently. Drag the latent vector to morph one variant, or reveal the whole set to see exactly which points the models can't agree on — that disagreement is predictive multiplicity.

Toy 2-class problem · decision boundary
Drag the latent vector · switch to the whole set
Latent vector z
−1z = +0.42+1
Accuracy — stays in the set
This variant
Reference boundary
Shade multiplicity

Illustrative synthetic data — a 2-D stand-in for the paper's real networks and datasets. It conveys the mechanism (accurate variants that disagree on hard cases), not the paper's actual figures. The violet ring marks the single most-contested point, echoing the paper's highest-disagreement examples.

The trade-off · interactive

Different, but still in the set

DIVERSE scores each candidate by how much it disagrees with the reference, times a soft penalty for losing accuracy. Their product peaks at a sweet spot. Move the Rashomon tolerance ε and watch where "as diverse as possible, still accurate" lands.

diversity accuracy penalty fitness (product) modulation strength →
ε = 0.01toleranceε = 0.05

Illustrative curves following the paper's fitness design F(z) = Div(z) · exp(−Δ² / 2ε²); the shapes are schematic, not measured.

How it works

Three ingredients

DIVERSE never touches the trained weights. It adds a controllable layer of variation on top and searches it without gradients.

01

FiLM — knobs on a frozen net

Feature-wise Linear Modulation applies a bounded affine tweak to a layer's activations: scale by γ and shift by β. With γ = 1 + tanh(zW) and β = tanh(zW), the changes stay small and the original weights never move.

FiLMfrozen weights
02

One shared latent vector z

A single low-dimensional vector drives every FiLM layer at once. At z = 0 you get the reference model exactly; any nonzero z is a new, coordinated, network-wide variant.

latent spacez = 0 anchor
03

CMA-ES — evolution, no gradients

Covariance Matrix Adaptation Evolution Strategy samples candidate z's, scores them, and adapts its search distribution toward better ones — no backprop, no gradient access required.

CMA-ESgradient-free
How they tested it

Setup at a glance

Three datasets spanning a complexity range, two baselines, and metrics for both the size and the internal structure of each Rashomon set.

DATA

3 datasets, 3 nets

MNIST (a 3-layer MLP), CIFAR-10 (VGG-16) and PneumoniaMNIST (ResNet-50), reference accuracies frozen as the Rashomon threshold.

SEARCH

Latent & CMA-ES knobs

Latent sizes d ∈ {2…64}, 10 initializations, step sizes σ₀ from 0.1–0.5, across tolerances ε from 0.01–0.05.

BASE

Two baselines

Retraining (the costly gold standard) and dropout sampling (fast, but test-set-defined). Adversarial weight perturbation was left out as impractical.

METRIC

Size + structure

Rashomon Ratio for set size; ambiguity, discrepancy, Viable Prediction Range and Rashomon Capacity for how the variants differ.

What they found

Minutes instead of hours, with real diversity

Retraining still tops raw diversity in most cases — DIVERSE doesn't claim to beat it there. What it offers is comparable diversity at a tiny fraction of the cost, and a clear edge over fast dropout sampling.

EFFICIENCY

less compute

Generating 640 model variants on CIFAR-10 took DIVERSE about 8.5 minutes, versus roughly 12.5 hours of retraining on the same GPU.

GENERALITY

0 architectures, one recipe

The same modulate-and-search procedure worked on a shallow MLP, VGG-16 and ResNet-50, with an encouraging preliminary result on a Vision Transformer.

DIVERSITY

Beats dropout

On CIFAR-10, DIVERSE exceeded the dropout baseline on every multiplicity metric — while judging validity on held-out validation data rather than the test set.

Honest read: on the hardest datasets only small latent sizes yielded valid sets, and DIVERSE remains slower than dropout (seconds). Its niche is the middle ground — far cheaper than retraining, more controllable and less optimistic than dropout.

Lessons from the experiments

Five practical takeaways Tap to expand.

End to end

The pipeline, in five steps

From a single trained model to a validated set of accurate-but-different variants.

FITNESS

Reward difference, punish drift

The fitness multiplies a diversity score by a soft Gaussian penalty, so the search is pulled toward variants that disagree usefully without leaving the accuracy band.

DIVERSITY

Two kinds of disagreement

"Hard" disagreement counts label flips; "soft" disagreement (Total Variation Distance) compares output probabilities. DIVERSE mixes both, equally weighted.

PROTOCOL

Honest validation

Membership is decided on validation data and diversity reported on a held-out test set — stricter than the dropout baseline, which judges membership on the test set itself.

Honest limits

What it can't (yet) do

L1

CMA-ES doesn't scale to big latent spaces

Full-covariance CMA-ES grows costly as the latent dimension rises, so search is limited to moderate sizes. Scalable variants such as DD-CMA-ES are a promising path.

L2

Different need not mean meaningful

FiLM gives controlled functional variation, but offers no guarantee that the variants differ in interpretable or causally meaningful ways.

L3

Mostly image tasks

Evidence spans three image datasets plus a small ViT test. Other modalities and a fuller transformer study are still needed to confirm generality.

L4

A local view

DIVERSE explores a local Rashomon set around one reference model, rather than the entire hypothesis space of all good models.

In one breath

A cheap map of a model's other selves

DIVERSE turns Rashomon-set exploration from an hours-long retraining job into a minutes-long search: freeze the trained network, wrap it in bounded FiLM layers, and evolve a small latent vector with gradient-free CMA-ES. It won't always match retraining's raw diversity, but it gets close, beats fast dropout sampling, and hands you an explicit knob on how different the variants are — practical groundwork for studying multiplicity, uncertainty and fairness at scale.