# Benchmark Evidence Readiness

Evidence readiness ledger for preventing synthetic benchmark scaffolds from being mistaken for production-grade leaderboard evidence.

Current benchmark pages are useful as scaffolds and buyer methodology examples. They should not be treated as final public leaderboard evidence until raw harness logs, provider response ids, screenshot/state proof, repeated seeds, and reviewer signoff are imported.

Average readiness score: 57/100

## Suites

| Suite | Status | Readiness | Trace packets | Model run rows | Blocking gates |
| --- | --- | ---: | ---: | ---: | --- |
| Indian Enterprise Workflow Suite | Not leaderboard-ready | 57 | 4 | 16 | Raw run log, Provider response identity, Screenshot or state proof, Repeat-run stability |
| Coding Agent Maintenance Suite | Not leaderboard-ready | 57 | 4 | 16 | Raw run log, Provider response identity, Screenshot or state proof, Repeat-run stability |
| Browser Operations Suite | Not leaderboard-ready | 57 | 4 | 16 | Raw run log, Provider response identity, Screenshot or state proof, Repeat-run stability |
| Support Agent Policy Suite | Not leaderboard-ready | 57 | 4 | 16 | Raw run log, Provider response identity, Screenshot or state proof, Repeat-run stability |
| AI Security & Risk Suite | Not leaderboard-ready | 57 | 4 | 16 | Raw run log, Provider response identity, Screenshot or state proof, Repeat-run stability |

## Evidence Gates

- **Task design and split:** Public/private split, task brief, expected output, scoring focus, leakage policy, and retirement rule exist for the suite.
- **Gold answer and rubric:** Every scored task has expected evidence, rubric components, and reviewer-facing scorecard.
- **Raw run log:** Every model run has raw harness logs, prompt packet id, run seed, exact model version, tool calls, retries, and answer excerpt.
- **Provider response identity:** Every model run preserves provider, model identifier, model version, timestamp, and response id or equivalent audit handle.
- **Screenshot or state proof:** Browser/app/document tasks preserve screenshot, DOM state, file diff, or comparable state proof for completion claims.
- **Repeat-run stability:** Leaderboard-boundary results are repeated across multiple seeds and instability is reported before ranking.
- **Reviewer signoff:** A named reviewer signs the scorecard, failure reason, and buyer recommendation after inspecting trace artifacts.

## Next Import Artifacts

- /reports/benchmark-intake/runbook.md
- /reports/benchmark-intake/run-template.csv
- /reports/benchmark-intake/run-schema.json
- /reports/benchmark-intake/reviewer-checklist.md
