# Benchmark Harness Kit

Adapter contracts and sample rows for moving real provider, browser-agent, coding-agent, and support-agent benchmark exports into the Edxperimental Labs intake pipeline.

## Harness Lanes

1. **Export:** Harness emits raw JSONL/trace artifacts before a reviewer touches the score.
2. **Validate:** Schema checks confirm task identity, model identity, prompt packet, artifacts, latency, and cost fields.
3. **Review:** Human reviewer assigns score, failure category, and acceptance state with a short rationale.
4. **Intake:** Accepted rows enter /api/benchmark-run-intake or the CSV template for static generation.
5. **Publish:** Generated benchmark pages expose aggregate scores while preserving artifact links and evidence caveats.

## Adapter Contracts

- [Provider API Harness](/reports/benchmark-harness-kit/provider-api-harness.md): One JSONL row per model run with model id, provider id, latency, token/cost fields, response artifact URI, and reviewer queue id.
- [Browser Agent Harness](/reports/benchmark-harness-kit/browser-agent-harness.md): Trace archive with DOM checkpoints, screenshot evidence, console warnings, timing, recovered errors, and final state proof.
- [Coding Agent Harness](/reports/benchmark-harness-kit/coding-agent-harness.md): Patch bundle with diff, terminal log, test output, browser proof when needed, and reviewer verdict.
- [Support Agent Harness](/reports/benchmark-harness-kit/support-agent-harness.md): Conversation transcript with policy citations, escalation choice, tone review, and final customer outcome.

## Quality Gates

- **Immutable task packet:** The row references a task id, source packet, prompt packet hash, and expected-output rubric. Prevents silently changing the task after a model has already run.
- **Exact system identity:** The row records provider, model id, model version, agent version, route, and settings. Prevents comparing vague brand names instead of reproducible systems.
- **State and artifact proof:** The row links raw answer, tool trace, screenshot, diff, terminal log, or transcript as appropriate. Prevents fluent completion claims from becoming benchmark evidence.
- **Human scoring reason:** The row includes reviewer, score, failure category, acceptance state, and reviewer note. Prevents a number from appearing without an auditable reason.

## Local Handoff Commands

```bash
pnpm benchmarks:harness
pnpm benchmarks:intake
pnpm benchmarks:generate
pnpm benchmarks:replay --suite browser-operations-suite --task pricing-page-extraction --json
```
