Downloadable benchmark brief

Indian Enterprise Workflow Suite Buyer Brief

Generated from Edxperimental Labs benchmark data: model rows, task traces, task mix, leaderboard controls, and the next evidence to collect before production.

Executive readout

Buyer decision memo

Strong candidate; inspect cost and latency before production use.

Model classScoreRecoveryCostP95
Frontier reasoning model8883525638ms
Fast mid-tier model7666814832ms
Open-weight local model6149736126ms
Small routing model5236924858ms

Representative trace packets

Inspectable tasks behind the score

TaskDomainSplitDifficultyTop runScore
GST invoice discrepancy explanationFinancepublicMediumFrontier reasoning model91
Hindi-English refund escalationSupportholdoutHardFrontier reasoning model86
Vendor contract renewal riskLegalpublicMediumFrontier reasoning model87
GST credit note reconciliationFinanceholdoutHardFrontier reasoning model89

Rubric

Outcome correctness

Rubric

Evidence citation

Rubric

Escalation judgement

Rubric

Cost per accepted output

Leaderboard controls

Controls attached to this run

FreshnessPublic sample refreshed monthly while private holdout stays sealed until replacement tasks exist.
Leakage policyDo not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.
Repeat-run ruleRepeat any result within five points of a leaderboard boundary across at least three seeds.
Retirement ruleRetire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.
Required provenancetraceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus