Mechanistic Interpretability Explainer

Circuit Tracing

Circuit Tracing matters when a deployment owner needs more than a pass/fail eval. It gives the audit team a way to collect internal evidence, test a causal hypothesis, and state the limits before changing a production control.

Markdown packet JSON packet

Field guide

What the method is for

Build a graph-like hypothesis for how features and components combine to produce behavior.

The operator question

What path connects input features, intermediate features, and the final behavior the audit cares about?

What the audit should produce

Circuit audit note with a graph, the behavior it explains, causal tests, and open limits.

Where the method fails

Circuit traces are targeted explanations, not complete model maps. They should explain a specific behavior under a defined task distribution.

Evidence priority

What the audit must collect.

Attribution graph100

Must be collected before the interpretation is trusted.

Feature-to-feature edges87

Use as supporting evidence and keep unresolved ambiguity visible.

Intervention checks74

Use as supporting evidence and keep unresolved ambiguity visible.

Behavioral case study61

Use as supporting evidence and keep unresolved ambiguity visible.

Audit template

A review packet for this method.

Behavior under reviewOne narrow failure, policy behavior, shortcut, or refusal pattern.

Candidate mechanismWhat path connects input features, intermediate features, and the final behavior the audit cares about?

Evidence packetAttribution graph; Feature-to-feature edges; Intervention checks; Behavioral case study

Decision boundaryWhat can change in production if the causal claim survives review.

Limit memoCircuit traces are targeted explanations, not complete model maps. They should explain a specific behavior under a defined task distribution.

Sources

Research trail.

Circuit Tracing methods Tracing the Thoughts of a Large Language Model Anthropic Responsible Scaling Policy