Train a classifier and you get one model. But many other models reach the very same accuracy by reasoning differently — and they can disagree on individual cases. That hidden family is a network's Rashomon set. DIVERSE maps it by freezing the trained model, adding small bounded "knobs", and evolving a tiny latent vector to surface accurate-but-different variants in minutes.
Picture a model that is 95% accurate at spotting a disease. There can be hundreds of other models, all 95% accurate, that decide differently on specific patients — one leans on shape, another on texture. Knowing this whole set matters for trust, fairness, and uncertainty. The catch is that finding it has been slow and hard to control.
The key reframe: instead of searching the network's enormous weight space, DIVERSE searches a small latent modulation space wrapped around the frozen model.
FiLM modulation plus CMA-ES finds diverse, accurate variants of a frozen model — with no retraining and no gradient access.
A disagreement-aware fitness steers how different the variants are, while a soft penalty keeps them inside the accuracy tolerance.
On an MLP, VGG-16 and ResNet-50 (plus a preliminary Vision Transformer), DIVERSE matches or beats dropout's diversity at far less cost than retraining.
Every model below is equally accurate, but each draws the boundary a little differently. Drag the latent vector to morph one variant, or reveal the whole set to see exactly which points the models can't agree on — that disagreement is predictive multiplicity.
Illustrative synthetic data — a 2-D stand-in for the paper's real networks and datasets. It conveys the mechanism (accurate variants that disagree on hard cases), not the paper's actual figures. The violet ring marks the single most-contested point, echoing the paper's highest-disagreement examples.
DIVERSE scores each candidate by how much it disagrees with the reference, times a soft penalty for losing accuracy. Their product peaks at a sweet spot. Move the Rashomon tolerance ε and watch where "as diverse as possible, still accurate" lands.
Illustrative curves following the paper's fitness design F(z) = Div(z) · exp(−Δ² / 2ε²); the shapes are schematic, not measured.
DIVERSE never touches the trained weights. It adds a controllable layer of variation on top and searches it without gradients.
Feature-wise Linear Modulation applies a bounded affine tweak to a layer's activations: scale by γ and shift by β. With γ = 1 + tanh(zW) and β = tanh(zW), the changes stay small and the original weights never move.
A single low-dimensional vector drives every FiLM layer at once. At z = 0 you get the reference model exactly; any nonzero z is a new, coordinated, network-wide variant.
Covariance Matrix Adaptation Evolution Strategy samples candidate z's, scores them, and adapts its search distribution toward better ones — no backprop, no gradient access required.
Three datasets spanning a complexity range, two baselines, and metrics for both the size and the internal structure of each Rashomon set.
MNIST (a 3-layer MLP), CIFAR-10 (VGG-16) and PneumoniaMNIST (ResNet-50), reference accuracies frozen as the Rashomon threshold.
Latent sizes d ∈ {2…64}, 10 initializations, step sizes σ₀ from 0.1–0.5, across tolerances ε from 0.01–0.05.
Retraining (the costly gold standard) and dropout sampling (fast, but test-set-defined). Adversarial weight perturbation was left out as impractical.
Rashomon Ratio for set size; ambiguity, discrepancy, Viable Prediction Range and Rashomon Capacity for how the variants differ.
Retraining still tops raw diversity in most cases — DIVERSE doesn't claim to beat it there. What it offers is comparable diversity at a tiny fraction of the cost, and a clear edge over fast dropout sampling.
Generating 640 model variants on CIFAR-10 took DIVERSE about 8.5 minutes, versus roughly 12.5 hours of retraining on the same GPU.
The same modulate-and-search procedure worked on a shallow MLP, VGG-16 and ResNet-50, with an encouraging preliminary result on a Vision Transformer.
On CIFAR-10, DIVERSE exceeded the dropout baseline on every multiplicity metric — while judging validity on held-out validation data rather than the test set.
Honest read: on the hardest datasets only small latent sizes yielded valid sets, and DIVERSE remains slower than dropout (seconds). Its niche is the middle ground — far cheaper than retraining, more controllable and less optimistic than dropout.
From a single trained model to a validated set of accurate-but-different variants.
The fitness multiplies a diversity score by a soft Gaussian penalty, so the search is pulled toward variants that disagree usefully without leaving the accuracy band.
"Hard" disagreement counts label flips; "soft" disagreement (Total Variation Distance) compares output probabilities. DIVERSE mixes both, equally weighted.
Membership is decided on validation data and diversity reported on a held-out test set — stricter than the dropout baseline, which judges membership on the test set itself.
Full-covariance CMA-ES grows costly as the latent dimension rises, so search is limited to moderate sizes. Scalable variants such as DD-CMA-ES are a promising path.
FiLM gives controlled functional variation, but offers no guarantee that the variants differ in interpretable or causally meaningful ways.
Evidence spans three image datasets plus a small ViT test. Other modalities and a fuller transformer study are still needed to confirm generality.
DIVERSE explores a local Rashomon set around one reference model, rather than the entire hypothesis space of all good models.
DIVERSE turns Rashomon-set exploration from an hours-long retraining job into a minutes-long search: freeze the trained network, wrap it in bounded FiLM layers, and evolve a small latent vector with gradient-free CMA-ES. It won't always match retraining's raw diversity, but it gets close, beats fast dropout sampling, and hands you an explicit knob on how different the variants are — practical groundwork for studying multiplicity, uncertainty and fairness at scale.