# Activation Patching

Activation Patching matters when a deployment owner needs more than a pass/fail eval. It gives the audit team a way to collect internal evidence, test a causal hypothesis, and state the limits before changing a production control.

## Field Guide

### What the method is for

Test whether a candidate internal component changes behavior when swapped or ablated.

### The operator question

Does changing the suspected feature, layer, head, or residual stream state change the model answer in the predicted direction?

### What the audit should produce

Causal test memo showing effect size, negative controls, and cases where the hypothesis fails.

### Where the method fails

Patch results can be brittle, distributed mechanisms can be missed, and prompt-pair design can bias the finding.

## Audit Template

| Field | Guidance |
| --- | --- |
| Behavior under review | One narrow failure, policy behavior, shortcut, or refusal pattern. |
| Candidate mechanism | Does changing the suspected feature, layer, head, or residual stream state change the model answer in the predicted direction? |
| Evidence packet | Clean/corrupted prompt pair; Patch target; Metric before and after patch; Random-feature baseline |
| Decision boundary | What can change in production if the causal claim survives review. |
| Limit memo | Patch results can be brittle, distributed mechanisms can be missed, and prompt-pair design can bias the finding. |

## Evidence To Collect

- Clean/corrupted prompt pair
- Patch target
- Metric before and after patch
- Random-feature baseline

## Sources

- [Transformer Circuits thread](https://transformer-circuits.pub/)
- [How Does Chain of Thought Think?](https://ojs.aaai.org/index.php/AAAI/article/view/40281)
- [Towards Automated Circuit Discovery](https://arxiv.org/abs/2304.14997)
