# Benchmark Run Intake Runbook

A concrete intake kit for replacing synthetic benchmark rows with real provider, notebook, browser-agent, and coding-agent run exports.

## When To Use This

Use this kit when a benchmark run comes from a notebook, provider API script, browser-agent trace, coding-agent run, support-policy eval, or manually reviewed workflow packet. The goal is to preserve enough evidence that a public score can be inspected later.

## Intake Stages

1. **Capture:** Export raw model/provider/notebook runs with exact task ids and model ids before any scoring cleanup.
2. **Normalize:** Map the export into the CSV template and attach artifact URIs for prompts, outputs, traces, and screenshots.
3. **Review:** Score each run against the suite rubric, write a reviewer note, and flag missing evidence before publishing.
4. **Generate:** Run pnpm benchmarks:generate to merge accepted rows into static suite, trace, report, and artifact pages.
5. **Verify:** Run pnpm benchmarks:replay for representative traces and pnpm verify:site against the local Webpack dev server.

## Required Fields

- **suiteSlug:** Existing benchmark suite slug, for example indian-enterprise-workflow-suite.
- **taskId:** Stable task id that matches the task packet or a new proposed packet.
- **modelId:** Exact model or agent identifier, including provider version when available.
- **provider:** Model/API/provider or local stack used for the run.
- **runSeed:** Deterministic seed or run attempt label.
- **startedAt:** ISO timestamp for the run start.
- **promptPacketHash:** Hash or immutable id for the prompt/task packet shown to the model.
- **inputArtifactUri:** Local path, object key, or redacted source packet id for the input.
- **outputArtifactUri:** Local path, object key, or redacted output packet id for the model answer.
- **toolTraceUri:** Tool-call trace, browser trace, terminal log, or not_applicable.
- **screenshotUri:** Browser/app screenshot proof or not_applicable.
- **rawCostUsd:** Observed provider cost for this run when available.
- **latencyMs:** Wall-clock latency from run start to final answer.
- **score:** Reviewer score on the suite rubric, 0 to 100.
- **reviewer:** Human reviewer or review queue owner.
- **reviewerNote:** Short note explaining acceptance, partial credit, or failure.

## Acceptance Gates

- **Packet identity:** A benchmark result is unusable if the task, prompt, source packet, and model output cannot be tied to stable ids. Evidence: suiteSlug, taskId, promptPacketHash, inputArtifactUri, outputArtifactUri.
- **Run reproducibility:** Close leaderboard calls need repeated runs, exact model identifiers, seeds, and enough metadata to reproduce the route. Evidence: modelId, provider, runSeed, startedAt, latencyMs.
- **State proof:** Agent and browser tasks must prove completion through tool traces, screenshots, file artifacts, or terminal logs rather than claims. Evidence: toolTraceUri, screenshotUri, outputArtifactUri.
- **Scoring audit:** A score needs reviewer rationale so a buyer can inspect why a run was accepted, partially accepted, or rejected. Evidence: score, reviewer, reviewerNote.

## Local Commands

```bash
pnpm benchmarks:intake
pnpm benchmarks:generate
pnpm benchmarks:replay --suite indian-enterprise-workflow-suite --task vendor-contract-renewal-risk
SITE_URL=http://127.0.0.1:3000 pnpm verify:site
```