Mechanistic Interpretability Explainer

Feature Dashboards

Feature Dashboards matters when a deployment owner needs more than a pass/fail eval. It gives the audit team a way to collect internal evidence, test a causal hypothesis, and state the limits before changing a production control.

Markdown packet JSON packet

Field guide

What the method is for

Make recurring internal features inspectable by operators, reviewers, and deployment owners.

The operator question

Can a reviewer see when policy, shortcut, refusal, sensitive-domain, or evidence-use features activate?

What the audit should produce

Monitoring surface for high-risk workflows, paired with eval metrics and human review notes.

Where the method fails

A dashboard can create false confidence if it shows labels without causal evidence or off-distribution checks.

Evidence priority

What the audit must collect.

Top examples100

Must be collected before the interpretation is trusted.

False positives87

Use as supporting evidence and keep unresolved ambiguity visible.

False negatives74

Use as supporting evidence and keep unresolved ambiguity visible.

Activation thresholds61

Use as supporting evidence and keep unresolved ambiguity visible.

Workflow slices48

Use as supporting evidence and keep unresolved ambiguity visible.

Audit template

A review packet for this method.

Behavior under reviewOne narrow failure, policy behavior, shortcut, or refusal pattern.

Candidate mechanismCan a reviewer see when policy, shortcut, refusal, sensitive-domain, or evidence-use features activate?

Evidence packetTop examples; False positives; False negatives; Activation thresholds; Workflow slices

Decision boundaryWhat can change in production if the causal claim survives review.

Limit memoA dashboard can create false confidence if it shows labels without causal evidence or off-distribution checks.

Sources

Research trail.

Mapping the Mind of a Large Language Model Sparse Autoencoders Find Highly Interpretable Features Sparse Autoencoder portal