Downloadable benchmark brief

Browser Operations Suite Buyer Brief

Generated from Edxperimental Labs benchmark data: model rows, task traces, task mix, leaderboard controls, and the next evidence to collect before production.

Executive readout

Buyer decision memo

Usable for constrained workflows with fallback routing.

Model classScoreRecoveryCostP95
Frontier reasoning model7169495764ms
Fast mid-tier model6055824874ms
Open-weight local model4339766000ms
Small routing model3125954732ms

Representative trace packets

Inspectable tasks behind the score

TaskDomainSplitDifficultyTop runScore
Pricing page extractionExtractionpublicMediumFrontier reasoning model73
Multi-step demo formForm completionholdoutHardFrontier reasoning model69
Invoice portal downloadNavigationpublicMediumFrontier reasoning model72
Competitor feature mapExtractionholdoutHardFrontier reasoning model70

Rubric

Task success

Rubric

State verification

Rubric

Recovery quality

Rubric

Human handoff rate

Leaderboard controls

Controls attached to this run

FreshnessPublic sample refreshed monthly while private holdout stays sealed until replacement tasks exist.
Leakage policyDo not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.
Repeat-run ruleRepeat any result within five points of a leaderboard boundary across at least three seeds.
Retirement ruleRetire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.
Required provenancetraceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus