Downloadable benchmark brief
Generated from Edxperimental Labs benchmark data: model rows, task traces, task mix, leaderboard controls, and the next evidence to collect before production.
Executive readout
Strong candidate; inspect cost and latency before production use.
| Model class | Score | Recovery | Cost | P95 |
|---|---|---|---|---|
| Frontier reasoning model | 86 | 82 | 51 | 5512ms |
| Fast mid-tier model | 79 | 70 | 83 | 4790ms |
| Open-weight local model | 58 | 47 | 74 | 6042ms |
| Small routing model | 49 | 35 | 93 | 4816ms |
Representative trace packets
| Task | Domain | Split | Difficulty | Top run | Score |
|---|---|---|---|---|---|
| Refund policy boundary case | Refund decisions | public | Medium | Frontier reasoning model | 88 |
| Regional-language human handoff | Language handoff | holdout | Hard | Frontier reasoning model | 83 |
| Subscription downgrade save | Refund decisions | public | Medium | Frontier reasoning model | 87 |
| PII redaction escalation | Policy lookup | holdout | Hard | Frontier reasoning model | 85 |
Rubric
Resolution rateRubric
Policy complianceRubric
Tone controlRubric
Escalation precisionLeaderboard controls
| Freshness | Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist. |
| Leakage policy | Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants. |
| Repeat-run rule | Repeat any result within five points of a leaderboard boundary across at least three seeds. |
| Retirement rule | Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated. |
| Required provenance | traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus |