Downloadable benchmark brief
Generated from Edxperimental Labs benchmark data: model rows, task traces, task mix, leaderboard controls, and the next evidence to collect before production.
Executive readout
Usable for constrained workflows with fallback routing.
| Model class | Score | Recovery | Cost | P95 |
|---|---|---|---|---|
| Frontier reasoning model | 71 | 69 | 49 | 5764ms |
| Fast mid-tier model | 60 | 55 | 82 | 4874ms |
| Open-weight local model | 43 | 39 | 76 | 6000ms |
| Small routing model | 31 | 25 | 95 | 4732ms |
Representative trace packets
| Task | Domain | Split | Difficulty | Top run | Score |
|---|---|---|---|---|---|
| Pricing page extraction | Extraction | public | Medium | Frontier reasoning model | 73 |
| Multi-step demo form | Form completion | holdout | Hard | Frontier reasoning model | 69 |
| Invoice portal download | Navigation | public | Medium | Frontier reasoning model | 72 |
| Competitor feature map | Extraction | holdout | Hard | Frontier reasoning model | 70 |
Rubric
Task successRubric
State verificationRubric
Recovery qualityRubric
Human handoff rateLeaderboard controls
| Freshness | Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist. |
| Leakage policy | Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants. |
| Repeat-run rule | Repeat any result within five points of a leaderboard boundary across at least three seeds. |
| Retirement rule | Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated. |
| Required provenance | traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus |