LAPEX — Latent-space Proxy Exploration
From Embeddings to Exploration
Interactive latent-space visualizations for AI model sensemaking — turning a static 2D embedding into a navigable, generative workspace.

Neural nets are usually inspected through static 2D projections (t-SNE, UMAP) that only show points the model has already seen. LAPEX replaces that frozen scatter plot with a Variational Autoencoder used as a generative proxy: every coordinate decodes to a plausible input, so you can sample, interpolate and probe how a black-box classifier behaves in the spaces between the data.

Formative study · N = 16 Static t-SNE vs Interactive VAE MNIST · FashionMNIST 7 guidelines
The gap

A scatter plot of past data can't answer "what if?"

Dimensionality-reduction views give a great overview of clustering, but they are static and bounded to the dataset. Crucially, t-SNE and UMAP have no inverse mapping — you can't decode a point you pick back into an input. A VAE can, and that decoding path is what unlocks interaction.

Static embeddings · t-SNE / UMAP

Inspect what the model has seen

  • Only existing data points are shown — no in-between inputs
  • No inverse mapping: you cannot decode a chosen location
  • Weak affordances for hypotheses & counterfactual reasoning
  • Boundaries look crisp, which can breed overconfidence
reframed as
Generative proxy · β-VAE

Inspect what the model does

  • A continuous latent space with an explicit decoder
  • Every coordinate decodes to a plausible input to re-classify
  • Supports continuous sampling, interpolation & region probing
  • Becomes a human-in-the-loop sensemaking workspace

The twist: the VAE is not explaining itself, and not just minting one-off counterfactuals — it's an interactive proxy for probing how a separate classifier reacts to generated, plausible inputs.

Contributions

Four contributions, one system.

C1

Latent spaces as interfaces

Reframe a VAE's latent space as an interactive, human-in-the-loop interface that probes a separate classifier's predictions — and implement it as LAPEX.

C2

Interactive probes

Generative views — continuous decision-space sampling, boundary tracing and counterfactual interpolation — unattainable with non-generative reduction.

C3

Formative study (N=16)

A within-subjects comparison of Static t-SNE vs Interactive VAE that characterizes how continuous probing changes exploration strategies.

C4

Engineering guidelines

Seven actionable guidelines grounded in observed behavior, task outcomes, order effects and concrete failure cases.

Try it · interactive

A live, miniature LAPEX.

A stylized re-creation of LAPEX on a FashionMNIST-style classifier. Flip between the Static baseline and the Interactive proxy, toggle the probes, hover to "decode" a point, and filter classes. The Static side only ever offers points and hulls — the exact asymmetry the paper studies.

Latent space · decision-space probes
drag to pan · scroll to zoom · hover to decode
Data scatter
Data points
Convex hulls
Latent-space insights · VAE only
Latent grid
Latent triangles
Confidence shading
Hovered / decoded sample
— hover the space —
predicted: · confidence —

Interactive mode · generative probes available. The latent grid fills the whole space by decoding a dense lattice; the latent triangles summarize predictions over a Delaunay triangulation of anchor points.

The headline interaction

Counterfactual interpolation — morph one class into another.

Pick two anchors, slide between them, and watch the decoded input — and the verdict — change continuously. You discover a decision boundary instead of guessing it. In the study, this one capability lifted counterfactual-task accuracy from 63% to 90%.

anchor "7"
anchor "9"
Classifier says: 7
Near an anchor the prediction is confident and stable.
7interpolation t9
⤳ decision boundary crossed
Anchor pair

Decoded tiles are hypotheses, not ground truth: smooth reconstructions are not guaranteed to stay on the data distribution. LAPEX therefore keeps real points visible as anchors and recommends flagging out-of-distribution status (Guideline G5).

The five probes

A familiar scatter plot, with generative overlays.

Each probe is a toggleable mask over the latent space. They share one layout; only the available overlays differ between conditions.

BOTH CONDITIONS

Scatter plot

The base 2D view of encoded instances — t-SNE in Static, the VAE's 2D latent (or a PCA of it) in Interactive. Points can show ground truth, prediction, and prediction-on-reconstruction, sorted into prototypes, criticisms and hull edges.

most usedreliable anchors
BOTH CONDITIONS

Convex hulls

Per-class outlines showing overlap, single-class dominance and empty regions at a glance. Not a real decision boundary, and outlier-sensitive — so users fell back to raw points for fine calls.

overlap at a glanceoutlier-sensitive
VAE ONLY

Latent grid + interpolation

A dense lattice decoded and re-classified everywhere; color = prediction, saturation = confidence. Fills gaps, resolves overlaps, and a slider interpolates between two points to trace boundaries. Best-received generative probe.

fills empty spaceboundary tracing
VAE ONLY

Latent triangles

A Delaunay triangulation over anchors, each triangle colored by the prediction on its decoded centroid. Selective and anchor-tied — but the most polarizing probe: "order in the chaos" to some, "too much" to others.

anchor-drivenpolarizing
VAE ONLY

Interpolated images

Decoded "bridges" between adjacent anchors (and barycentric ones inside triangles). They show which features change between two real samples — implicit counterfactuals for visual comparison.

feature changeimplicit counterfactuals
SENSEMAKING LENS

Cheap-to-test frames

Per Klein's data–frame theory, probes are low-cost scaffolds: form a hunch, test it with a small interpolation, revise. The decoder makes sampling sparse regions cheap, raising the yield of exploration.

data–frame theoryinformation foraging
How it was studied

A within-subjects formative study.

Sixteen computer scientists completed the same tasks in both conditions, counterbalanced (Latin square) across conditions and datasets. Think-aloud throughout; sessions ~98 min. Approved by Hasselt University's ethics committee. The work appears at EICS 2026.

SETUP

2 × 2 design

Conditions: Static t-SNE vs Interactive VAE. Datasets: MNIST & FashionMNIST (both 28×28, 10 classes), switched between conditions.

MODELS

β-VAE proxies

MNIST: a 2D latent, β = 16 (no extra reduction needed). FashionMNIST: a 16D latent, β = 8, projected via PCA for the 2D view.

MEASURES

3 standardized scales

System Causability (SCS), Explanation Satisfaction (ESS), and an XAI Trust scale — plus per-probe ratings and per-task confidence.

LENS

Reasoning, not speed

Not a speed test: the focus is exploration strategies and sensemaking. No significant timing differences were found between approaches.

T1 · WHEN?

Decision boundary

Identify when the model picks one class over another; draw the boundary and justify it.

T2 · HOW TO?

Counterfactual morph

Find an image of class A; explain the minimal changes that flip it to class B. The interactive win.

T3 · SIMILARITY

Pairwise ranking

Rank three class pairs from most to least similar, with reasoning.

T4 · TRUST

Regional trust calibration

Across three regions — out-of-bounds Pilea, overlapping Soricida, single-class Arenaria — say where you trust the model.

What they found

Generative interaction reshaped how people reasoned.

The two approaches proved complementary: Static t-SNE gives clearer boundaries and more confident boundary calls; the Interactive VAE enables richer exploration and much better counterfactual reasoning. (Some effects were underpowered at N=16.)

TASK 2 · COUNTERFACTUALS

0% accuracy

Interactive vs 63% Static — a significant lift. 10 / 16 participants named image morphing the single most useful distinguishing feature.

TASK 4 · TRUST CALIBRATION

0 / 16 flipped

Nine participants changed their trust ranking once the latent grid revealed confident predictions in the empty Pilea region — trusting confident generated samples over noisy real overlap.

PERCEIVED EFFICIENCY

Trust ↑ 81

Median XAI-Trust rose 0.81 → 0.83; "works very quickly" was rated significantly higher for the Interactive version. SCS median rose 0.75 → 0.81.

TASK 1 · BOUNDARIES

Static wins here

t-SNE's crisp clusters supported more confident boundary calls; the PCA-projected latent showed more overlap. A recurring failure mode: users expect an explicit line even where none exists.

ORDER EFFECT

Can't go back

Participants who started Interactive rated the Static version significantly worse afterwards — they kept reaching for the missing morphing & grid interpolation.

PROBE RANKING

Scatter ≫ grid ≫ triangles

The scatter plot was most used, understandable & trusted in both conditions. The latent grid + interpolation were the best-received generative probes; triangles split opinion.

C4 · Engineering guidelines

Seven rules for building latent-space explorers. Tap to expand.

C1 · How it's built

From a black box to a navigable workspace, in five stages.

Train several VAE proxies, pick one on a Pareto front, and hand it to a modular API that exposes one Router per probe to a React frontend. Caching the heavy, immutable results (prototypes, criticisms, grid samples) cuts time-to-explore from minutes to seconds.

PROXY SELECTION

Grid search + Pareto front

Sweep latent dimension and β; score on reconstruction (loss, FID), fidelity to the classifier (task accuracy, fidelity accuracy, ShapGAP) and reduction quality (random-triplet accuracy). Pick the proxy that best matches the classifier without sacrificing reconstruction.

BACKEND

Routers on an abstract API

Python 3.13.2 · PyTorch · Flask. An abstract base class handles caching, metadata and requests; concrete subclasses implement the Static (scikit-learn t-SNE) or Interactive (VAE) method. Prototypes & criticisms via MMD-critic. Adding a probe = one Router + one frontend view.

FRONTEND & LOCAL MODE

Canvas + SSIM zoom

React over an HTML5 canvas with an InteractiveCanvas abstraction for pan/zoom/hit-testing. For local class comparisons, LAPEX swaps PCA for an SSIM ranker: perturb each latent dim by ±3 over 32 anchors and keep the two dimensions with the largest visual change.

Honest limits

What to keep in mind before deploying.

PROXY ≠ TRUTH

Decoded samples are hypotheses

You explore a learned proxy, not the classifier's true input distribution. Predictions are real, but reconstructions may drift off the data manifold — treat probed inputs as cues, not facts.

COGNITIVE LOAD

Not for non-technical users yet

Exploration can be demanding; participants hesitated on harder tasks (with no clear link to AI expertise). LAPEX targets developers and researchers.

VALIDATION GAP

Local probes not separately tested

The SSIM-based local probes come from qualitative insights but weren't re-evaluated with users — read them as design implications.

PRIVACY & IP

Exploration can leak

Explanations can expose boundaries, shortcuts and sensitive information, opening models to attacks. Use with care outside controlled environments.

In one breath

Proxies are effective exploration tools — and liked across expertise levels.

Static t-SNE stayed "very helpful" for orientation; the generative VAE layer made it "even better" for counterfactual discovery and reasoning about sparse or out-of-bounds regions. The best choice is task-dependent, so LAPEX offers both, defaults to the full generative interface, and stays modular enough to grow new probes.