# Sparse Autoencoders

Sparse Autoencoders matters when a deployment owner needs more than a pass/fail eval. It gives the audit team a way to collect internal evidence, test a causal hypothesis, and state the limits before changing a production control.

## Field Guide

### What the method is for

Turn dense activations into sparse feature dictionaries that analysts can inspect.

### The operator question

Which recurring features activate during the failure, refusal, shortcut, or policy behavior?

### What the audit should produce

Feature dashboard with example prompts, activation histograms, analyst labels, and unresolved ambiguity.

### Where the method fails

Feature labels are hypotheses. Reliability depends on dictionary quality, feature consistency, and causal follow-up.

## Audit Template

| Field | Guidance |
| --- | --- |
| Behavior under review | One narrow failure, policy behavior, shortcut, or refusal pattern. |
| Candidate mechanism | Which recurring features activate during the failure, refusal, shortcut, or policy behavior? |
| Evidence packet | Activation dataset; SAE dictionary health; Top activating examples; Feature labels with counterexamples |
| Decision boundary | What can change in production if the causal claim survives review. |
| Limit memo | Feature labels are hypotheses. Reliability depends on dictionary quality, feature consistency, and causal follow-up. |

## Evidence To Collect

- Activation dataset
- SAE dictionary health
- Top activating examples
- Feature labels with counterexamples

## Sources

- [Towards Monosemanticity](https://transformer-circuits.pub/2023/monosemantic-features/index.html)
- [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
- [Feature consistency in SAEs](https://huggingface.co/papers/2505.20254)
