Downloadable benchmark brief
Generated from Edxperimental Labs benchmark data: model rows, task traces, task mix, leaderboard controls, and the next evidence to collect before production.
Executive readout
Usable for constrained workflows with fallback routing.
| Model class | Score | Recovery | Cost | P95 |
|---|---|---|---|---|
| Frontier reasoning model | 76 | 74 | 47 | 5848ms |
| Fast mid-tier model | 62 | 58 | 79 | 4916ms |
| Open-weight local model | 51 | 45 | 71 | 6210ms |
| Small routing model | 34 | 28 | 94 | 4774ms |
Representative trace packets
| Task | Domain | Split | Difficulty | Top run | Score |
|---|---|---|---|---|---|
| Fix Command-K search regression | Frontend | public | Medium | Frontier reasoning model | 78 |
| Repair failing static build | Build | holdout | Hard | Frontier reasoning model | 74 |
| Add Playwright smoke test | Frontend QA | public | Medium | Frontier reasoning model | 77 |
| Refactor API error handling | Backend | holdout | Hard | Frontier reasoning model | 75 |
Rubric
Patch correctnessRubric
Regression rateRubric
Tool disciplineRubric
Review readinessLeaderboard controls
| Freshness | Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist. |
| Leakage policy | Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants. |
| Repeat-run rule | Repeat any result within five points of a leaderboard boundary across at least three seeds. |
| Retirement rule | Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated. |
| Required provenance | traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus |