Back to playbook

Mechanistic Interpretability Explainer

Audit Limitations

Audit Limitations matters when a deployment owner needs more than a pass/fail eval. It gives the audit team a way to collect internal evidence, test a causal hypothesis, and state the limits before changing a production control.

Field guide

1

What the method is for

Prevent interpretability findings from being oversold as proof of safety or correctness.

2

The operator question

What would make this interpretation fail, and what decision is still unsupported?

3

What the audit should produce

Limit memo that states what the audit proves, what it does not prove, and what control should remain in production.

4

Where the method fails

SAEs can surface patterns in unexpected places, including random or weakly related systems, so operational claims need baselines and decision boundaries.

Evidence priority

What the audit must collect.

Counterexamples100

Must be collected before the interpretation is trusted.

Feature consistency checks87

Use as supporting evidence and keep unresolved ambiguity visible.

Random-model or random-feature baselines74

Use as supporting evidence and keep unresolved ambiguity visible.

Out-of-domain tests61

Use as supporting evidence and keep unresolved ambiguity visible.

Audit template

A review packet for this method.

Behavior under reviewOne narrow failure, policy behavior, shortcut, or refusal pattern.
Candidate mechanismWhat would make this interpretation fail, and what decision is still unsupported?
Evidence packetCounterexamples; Feature consistency checks; Random-model or random-feature baselines; Out-of-domain tests
Decision boundaryWhat can change in production if the causal claim survives review.
Limit memoSAEs can surface patterns in unexpected places, including random or weakly related systems, so operational claims need baselines and decision boundaries.