Mechanistic Interpretability Explainer

Sparse Autoencoders

Sparse Autoencoders matters when a deployment owner needs more than a pass/fail eval. It gives the audit team a way to collect internal evidence, test a causal hypothesis, and state the limits before changing a production control.

Markdown packet JSON packet

Field guide

What the method is for

Turn dense activations into sparse feature dictionaries that analysts can inspect.

The operator question

Which recurring features activate during the failure, refusal, shortcut, or policy behavior?

What the audit should produce

Feature dashboard with example prompts, activation histograms, analyst labels, and unresolved ambiguity.

Where the method fails

Feature labels are hypotheses. Reliability depends on dictionary quality, feature consistency, and causal follow-up.

Evidence priority

What the audit must collect.

Activation dataset100

Must be collected before the interpretation is trusted.

SAE dictionary health87

Use as supporting evidence and keep unresolved ambiguity visible.

Top activating examples74

Use as supporting evidence and keep unresolved ambiguity visible.

Feature labels with counterexamples61

Use as supporting evidence and keep unresolved ambiguity visible.

Audit template

A review packet for this method.

Behavior under reviewOne narrow failure, policy behavior, shortcut, or refusal pattern.

Candidate mechanismWhich recurring features activate during the failure, refusal, shortcut, or policy behavior?

Evidence packetActivation dataset; SAE dictionary health; Top activating examples; Feature labels with counterexamples

Decision boundaryWhat can change in production if the causal claim survives review.

Limit memoFeature labels are hypotheses. Reliability depends on dictionary quality, feature consistency, and causal follow-up.

Sources

Research trail.

Towards Monosemanticity Scaling Monosemanticity Feature consistency in SAEs