Mechanistic Interpretability Explainer

Activation Patching

Activation Patching matters when a deployment owner needs more than a pass/fail eval. It gives the audit team a way to collect internal evidence, test a causal hypothesis, and state the limits before changing a production control.

Markdown packet JSON packet

Field guide

What the method is for

Test whether a candidate internal component changes behavior when swapped or ablated.

The operator question

Does changing the suspected feature, layer, head, or residual stream state change the model answer in the predicted direction?

What the audit should produce

Causal test memo showing effect size, negative controls, and cases where the hypothesis fails.

Where the method fails

Patch results can be brittle, distributed mechanisms can be missed, and prompt-pair design can bias the finding.

Evidence priority

What the audit must collect.

Clean/corrupted prompt pair100

Must be collected before the interpretation is trusted.

Patch target87

Use as supporting evidence and keep unresolved ambiguity visible.

Metric before and after patch74

Use as supporting evidence and keep unresolved ambiguity visible.

Random-feature baseline61

Use as supporting evidence and keep unresolved ambiguity visible.

Audit template

A review packet for this method.

Behavior under reviewOne narrow failure, policy behavior, shortcut, or refusal pattern.

Candidate mechanismDoes changing the suspected feature, layer, head, or residual stream state change the model answer in the predicted direction?

Evidence packetClean/corrupted prompt pair; Patch target; Metric before and after patch; Random-feature baseline

Decision boundaryWhat can change in production if the causal claim survives review.

Limit memoPatch results can be brittle, distributed mechanisms can be missed, and prompt-pair design can bias the finding.

Sources

Research trail.

Transformer Circuits thread How Does Chain of Thought Think?Towards Automated Circuit Discovery