DIVERSE · Rashomon Set Exploration

The gap

Equally good models can quietly disagree

Picture a model that is 95% accurate at spotting a disease. There can be hundreds of other models, all 95% accurate, that decide differently on specific patients — one leans on shape, another on texture. Knowing this whole set matters for trust, fairness, and uncertainty. The catch is that finding it has been slow and hard to control.

The usual options

Expensive or uncontrolled

Retrain from scratch with new seeds — hours of compute, and equal accuracy isn't guaranteed.
Adversarial weight perturbation — scales poorly; costly for every sample-and-class pair.
Dropout sampling — fast, but diversity is uncontrolled and judged on the test set.

INSTEAD

DIVERSE

Modulate, don't retrain

Freeze the trained model and add bounded FiLM "knobs".
Evolve a tiny latent vector with gradient-free CMA-ES.
Tune how different the variants are — explicit diversity, in minutes.

The key reframe: instead of searching the network's enormous weight space, DIVERSE searches a small latent modulation space wrapped around the frozen model.

Contributions

What the paper adds

C1

A retraining-free Rashomon explorer

FiLM modulation plus CMA-ES finds diverse, accurate variants of a frozen model — with no retraining and no gradient access.

C2

Explicit, tunable diversity

A disagreement-aware fitness steers how different the variants are, while a soft penalty keeps them inside the accuracy tolerance.

C3

Validated across architectures

On an MLP, VGG-16 and ResNet-50 (plus a preliminary Vision Transformer), DIVERSE matches or beats dropout's diversity at far less cost than retraining.

The method · Figure 1

Where the FiLM “knobs” go

DIVERSE inserts frozen FiLM layers into a trained network in three places, depending on the architecture: after dense layers, after convolutional blocks (following batch-norm when present), and on residual skip connections. A single shared latent vector z drives every FiLM block at once, so one small vector reshapes the whole network’s behaviour while the original weights stay frozen.

Three FiLM insertion strategies: after dense layers, after convolutional and batch-norm blocks, and on residual skip connections, all driven by a shared latent vector z. — **Figure 1.** FiLM placement (blue): (a) after dense layers; (b) after convolutional blocks — after batch normalization when present, otherwise directly after the convolution; (c) on residual skip connections. A shared latent vector z supplies the modulation parameters for every FiLM layer in a model.

Multiplicity, made visible · Figure 3

One input, many verdicts

For each dataset, this is the single most-contested sample — the input the Rashomon set disagrees on most. The bars show how the equally-accurate variants vote across the classes, with the true class in green. On MNIST most variants still land on the right answer; on CIFAR-10 the vote splinters across several classes; and on the binary chest-X-ray task the set splits almost evenly — equally good models, genuinely different calls.

Highest-disagreement samples for MNIST, CIFAR-10 and PneumoniaMNIST, each with a class-frequency histogram over the Rashomon set; the true class is green. — **Figure 3.** Highest-disagreement samples for MNIST, CIFAR-10 and PneumoniaMNIST. Each panel shows the input image and the class-frequency distribution over the Rashomon-set members; the true class is green, all others grey.

How it works

Three ingredients

DIVERSE never touches the trained weights. It adds a controllable layer of variation on top and searches it without gradients.

01

FiLM — knobs on a frozen net

Feature-wise Linear Modulation applies a bounded affine tweak to a layer's activations: scale by γ and shift by β. With γ = 1 + tanh(zW) and β = tanh(zW), the changes stay small and the original weights never move.

FiLMfrozen weights

02

One shared latent vector z

A single low-dimensional vector drives every FiLM layer at once. At z = 0 you get the reference model exactly; any nonzero z is a new, coordinated, network-wide variant.

latent spacez = 0 anchor

03

CMA-ES — evolution, no gradients

Covariance Matrix Adaptation Evolution Strategy samples candidate z's, scores them, and adapts its search distribution toward better ones — no backprop, no gradient access required.

CMA-ESgradient-free

How they tested it

Setup at a glance

Three datasets spanning a complexity range, two baselines, and metrics for both the size and the internal structure of each Rashomon set.

DATA

3 datasets, 3 nets

MNIST (a 3-layer MLP), CIFAR-10 (VGG-16) and PneumoniaMNIST (ResNet-50), reference accuracies frozen as the Rashomon threshold.

SEARCH

Latent & CMA-ES knobs

Latent sizes d ∈ {2…64}, 10 initializations, step sizes σ₀ from 0.1–0.5, across tolerances ε from 0.01–0.05.

BASE

Two baselines

Retraining (the costly gold standard) and dropout sampling (fast, but test-set-defined). Adversarial weight perturbation was left out as impractical.

METRIC

Size + structure

Rashomon Ratio for set size; ambiguity, discrepancy, Viable Prediction Range and Rashomon Capacity for how the variants differ.

What they found

Minutes instead of hours, with real diversity

Retraining still tops raw diversity in most cases — DIVERSE doesn't claim to beat it there. What it offers is comparable diversity at a tiny fraction of the cost, and a clear edge over fast dropout sampling.

EFFICIENCY

0× less compute

Generating 640 model variants on CIFAR-10 took DIVERSE about 8.5 minutes, versus roughly 12.5 hours of retraining on the same GPU.

GENERALITY

0 architectures, one recipe

The same modulate-and-search procedure worked on a shallow MLP, VGG-16 and ResNet-50, with an encouraging preliminary result on a Vision Transformer.

DIVERSITY

Beats dropout

On CIFAR-10, DIVERSE exceeded the dropout baseline on every multiplicity metric — while judging validity on held-out validation data rather than the test set.

Honest read: on the hardest datasets only small latent sizes yielded valid sets, and DIVERSE remains slower than dropout (seconds). Its niche is the middle ground — far cheaper than retraining, more controllable and less optimistic than dropout.

Lessons from the experiments

Five practical takeaways Tap to expand.

End to end

The pipeline, in five steps

From a single trained model to a validated set of accurate-but-different variants.

FITNESS

Reward difference, punish drift

The fitness multiplies a diversity score by a soft Gaussian penalty, so the search is pulled toward variants that disagree usefully without leaving the accuracy band.

DIVERSITY

Two kinds of disagreement

"Hard" disagreement counts label flips; "soft" disagreement (Total Variation Distance) compares output probabilities. DIVERSE mixes both, equally weighted.

PROTOCOL

Honest validation

Membership is decided on validation data and diversity reported on a held-out test set — stricter than the dropout baseline, which judges membership on the test set itself.

Honest limits

What it can't (yet) do

L1

CMA-ES doesn't scale to big latent spaces

Full-covariance CMA-ES grows costly as the latent dimension rises, so search is limited to moderate sizes. Scalable variants such as DD-CMA-ES are a promising path.

L2

Different need not mean meaningful

FiLM gives controlled functional variation, but offers no guarantee that the variants differ in interpretable or causally meaningful ways.

L3

Mostly image tasks

Evidence spans three image datasets plus a small ViT test. Other modalities and a fuller transformer study are still needed to confirm generality.

L4

A local view

DIVERSE explores a local Rashomon set around one reference model, rather than the entire hypothesis space of all good models.

In one breath

A cheap map of a model's other selves

DIVERSE turns Rashomon-set exploration from an hours-long retraining job into a minutes-long search: freeze the trained network, wrap it in bounded FiLM layers, and evolve a small latent vector with gradient-free CMA-ES. It won't always match retraining's raw diversity, but it gets close, beats fast dropout sampling, and hands you an explicit knob on how different the variants are — practical groundwork for studying multiplicity, uncertainty and fairness at scale.

↓ Full paper (PDF) ◧ Paper page ⌗ arXiv:2601.20627 ↗ Digital Future Lab