Downloadable benchmark brief

Coding Agent Maintenance Suite Buyer Brief

Generated from Edxperimental Labs benchmark data: model rows, task traces, task mix, leaderboard controls, and the next evidence to collect before production.

Executive readout

Buyer decision memo

Usable for constrained workflows with fallback routing.

Model classScoreRecoveryCostP95
Frontier reasoning model7674475848ms
Fast mid-tier model6258794916ms
Open-weight local model5145716210ms
Small routing model3428944774ms

Representative trace packets

Inspectable tasks behind the score

TaskDomainSplitDifficultyTop runScore
Fix Command-K search regressionFrontendpublicMediumFrontier reasoning model78
Repair failing static buildBuildholdoutHardFrontier reasoning model74
Add Playwright smoke testFrontend QApublicMediumFrontier reasoning model77
Refactor API error handlingBackendholdoutHardFrontier reasoning model75

Rubric

Patch correctness

Rubric

Regression rate

Rubric

Tool discipline

Rubric

Review readiness

Leaderboard controls

Controls attached to this run

FreshnessPublic sample refreshed monthly while private holdout stays sealed until replacement tasks exist.
Leakage policyDo not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.
Repeat-run ruleRepeat any result within five points of a leaderboard boundary across at least three seeds.
Retirement ruleRetire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.
Required provenancetraceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus