Neural nets are usually inspected through static 2D projections (t-SNE, UMAP) that only show points the model has already seen. LAPEX replaces that frozen scatter plot with a Variational Autoencoder used as a generative proxy: every coordinate decodes to a plausible input, so you can sample, interpolate and probe how a black-box classifier behaves in the spaces between the data.
Dimensionality-reduction views give a great overview of clustering, but they are static and bounded to the dataset. Crucially, t-SNE and UMAP have no inverse mapping — you can't decode a point you pick back into an input. A VAE can, and that decoding path is what unlocks interaction.
The twist: the VAE is not explaining itself, and not just minting one-off counterfactuals — it's an interactive proxy for probing how a separate classifier reacts to generated, plausible inputs.
Reframe a VAE's latent space as an interactive, human-in-the-loop interface that probes a separate classifier's predictions — and implement it as LAPEX.
Generative views — continuous decision-space sampling, boundary tracing and counterfactual interpolation — unattainable with non-generative reduction.
A within-subjects comparison of Static t-SNE vs Interactive VAE that characterizes how continuous probing changes exploration strategies.
Seven actionable guidelines grounded in observed behavior, task outcomes, order effects and concrete failure cases.
A stylized re-creation of LAPEX on a FashionMNIST-style classifier. Flip between the Static baseline and the Interactive proxy, toggle the probes, hover to "decode" a point, and filter classes. The Static side only ever offers points and hulls — the exact asymmetry the paper studies.
Interactive mode · generative probes available. The latent grid fills the whole space by decoding a dense lattice; the latent triangles summarize predictions over a Delaunay triangulation of anchor points.
Pick two anchors, slide between them, and watch the decoded input — and the verdict — change continuously. You discover a decision boundary instead of guessing it. In the study, this one capability lifted counterfactual-task accuracy from 63% to 90%.
Decoded tiles are hypotheses, not ground truth: smooth reconstructions are not guaranteed to stay on the data distribution. LAPEX therefore keeps real points visible as anchors and recommends flagging out-of-distribution status (Guideline G5).
Each probe is a toggleable mask over the latent space. They share one layout; only the available overlays differ between conditions.
The base 2D view of encoded instances — t-SNE in Static, the VAE's 2D latent (or a PCA of it) in Interactive. Points can show ground truth, prediction, and prediction-on-reconstruction, sorted into prototypes, criticisms and hull edges.
Per-class outlines showing overlap, single-class dominance and empty regions at a glance. Not a real decision boundary, and outlier-sensitive — so users fell back to raw points for fine calls.
A dense lattice decoded and re-classified everywhere; color = prediction, saturation = confidence. Fills gaps, resolves overlaps, and a slider interpolates between two points to trace boundaries. Best-received generative probe.
A Delaunay triangulation over anchors, each triangle colored by the prediction on its decoded centroid. Selective and anchor-tied — but the most polarizing probe: "order in the chaos" to some, "too much" to others.
Decoded "bridges" between adjacent anchors (and barycentric ones inside triangles). They show which features change between two real samples — implicit counterfactuals for visual comparison.
Per Klein's data–frame theory, probes are low-cost scaffolds: form a hunch, test it with a small interpolation, revise. The decoder makes sampling sparse regions cheap, raising the yield of exploration.
Sixteen computer scientists completed the same tasks in both conditions, counterbalanced (Latin square) across conditions and datasets. Think-aloud throughout; sessions ~98 min. Approved by Hasselt University's ethics committee. The work appears at EICS 2026.
Conditions: Static t-SNE vs Interactive VAE. Datasets: MNIST & FashionMNIST (both 28×28, 10 classes), switched between conditions.
MNIST: a 2D latent, β = 16 (no extra reduction needed). FashionMNIST: a 16D latent, β = 8, projected via PCA for the 2D view.
System Causability (SCS), Explanation Satisfaction (ESS), and an XAI Trust scale — plus per-probe ratings and per-task confidence.
Not a speed test: the focus is exploration strategies and sensemaking. No significant timing differences were found between approaches.
Identify when the model picks one class over another; draw the boundary and justify it.
Find an image of class A; explain the minimal changes that flip it to class B. The interactive win.
Rank three class pairs from most to least similar, with reasoning.
Across three regions — out-of-bounds Pilea, overlapping Soricida, single-class Arenaria — say where you trust the model.
The two approaches proved complementary: Static t-SNE gives clearer boundaries and more confident boundary calls; the Interactive VAE enables richer exploration and much better counterfactual reasoning. (Some effects were underpowered at N=16.)
Interactive vs 63% Static — a significant lift. 10 / 16 participants named image morphing the single most useful distinguishing feature.
Nine participants changed their trust ranking once the latent grid revealed confident predictions in the empty Pilea region — trusting confident generated samples over noisy real overlap.
Median XAI-Trust rose 0.81 → 0.83; "works very quickly" was rated significantly higher for the Interactive version. SCS median rose 0.75 → 0.81.
t-SNE's crisp clusters supported more confident boundary calls; the PCA-projected latent showed more overlap. A recurring failure mode: users expect an explicit line even where none exists.
Participants who started Interactive rated the Static version significantly worse afterwards — they kept reaching for the missing morphing & grid interpolation.
The scatter plot was most used, understandable & trusted in both conditions. The latent grid + interpolation were the best-received generative probes; triangles split opinion.
Train several VAE proxies, pick one on a Pareto front, and hand it to a modular API that exposes one Router per probe to a React frontend. Caching the heavy, immutable results (prototypes, criticisms, grid samples) cuts time-to-explore from minutes to seconds.
Sweep latent dimension and β; score on reconstruction (loss, FID), fidelity to the classifier (task accuracy, fidelity accuracy, ShapGAP) and reduction quality (random-triplet accuracy). Pick the proxy that best matches the classifier without sacrificing reconstruction.
Python 3.13.2 · PyTorch · Flask. An abstract base class handles caching, metadata and requests; concrete subclasses implement the Static (scikit-learn t-SNE) or Interactive (VAE) method. Prototypes & criticisms via MMD-critic. Adding a probe = one Router + one frontend view.
React over an HTML5 canvas with an InteractiveCanvas abstraction for pan/zoom/hit-testing. For local class comparisons, LAPEX swaps PCA for an SSIM ranker: perturb each latent dim by ±3 over 32 anchors and keep the two dimensions with the largest visual change.
You explore a learned proxy, not the classifier's true input distribution. Predictions are real, but reconstructions may drift off the data manifold — treat probed inputs as cues, not facts.
Exploration can be demanding; participants hesitated on harder tasks (with no clear link to AI expertise). LAPEX targets developers and researchers.
The SSIM-based local probes come from qualitative insights but weren't re-evaluated with users — read them as design implications.
Explanations can expose boundaries, shortcuts and sensitive information, opening models to attacks. Use with care outside controlled environments.
Static t-SNE stayed "very helpful" for orientation; the generative VAE layer made it "even better" for counterfactual discovery and reasoning about sparse or out-of-bounds regions. The best choice is task-dependent, so LAPEX offers both, defaults to the full generative interface, and stays modular enough to grow new probes.