Benchmarks
Building a Useful AI Leaderboard Without Fooling Ourselves
A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.
Benchmarks
A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.
Research lens
6 controls
4 leaderboard tracks
A leaderboard is only the visible layer. The real asset is the protocol: task sampling, private holdouts, scoring rubrics, repeat runs, audit logs, model settings, and a policy for retiring stale tasks. Without that protocol, the ranking becomes a design object with numbers attached.
Generic tests are useful for orientation, but deployment decisions need tasks that look like support tickets, GST invoice checks, sales-call summaries, bilingual handoffs, policy retrieval, and document-heavy back-office work. The unit of measurement should be completed workflow, not isolated answer accuracy.
The site should use a rotating public sample, a private holdout, timestamped task provenance, adversarial prompt variants, and periodic benchmark retirement. LiveBench and SWE-bench-Live point in the right direction: refresh the task distribution so models cannot simply memorize yesterday's test.
Visual
Illustrative trust score; higher is better.
Task freshness
Reduces training-data leakage
Monthly workflow refresh with provenance
Private holdout
Prevents direct optimization
Keep a non-public eval split
Multiple seeds
Controls prompt variance
Repeat runs across temperatures/settings
Outcome rubrics
Measures business utility
Human review plus structured scoring
Process
Contamination risk
High
Static public tests become weak evidence once models and prompts optimize against them.
Best unit
Workflow
Score the completed business outcome, not only the model's isolated answer.
Refresh cycle
Monthly
Retire stale tasks and keep a dated public sample for reproducibility.
LiveBench pattern
Fresh, objective tasks
LiveBench limits contamination by drawing from recently released material and objective ground truth; Edxperimental should use the same idea for Indian workflow tasks.
SWE-bench-Live warning
Static scores can overstate readiness
Live issue-resolution benchmarks show why repository diversity, Dockerized execution, and update cadence matter for coding-agent claims.
Edxperimental control
Public sample plus private holdout
Publish enough tasks for trust, keep enough private tasks for signal, and rotate stale items out of the scoring pool.
Research map
A useful leaderboard makes each layer inspectable, from task sourcing to buyer recommendation. The map below is the minimum evidence chain before a ranking deserves trust.
Task source
Where did this task come from, and could it be in training data?
Trace provenance
Gold answer
What exactly counts as success before the model runs?
Rubric packet
Model run
Is this a stable result or one lucky prompt path?
Run ledger
Human audit
What broke, how severe was it, and who reviewed it?
Reviewer memo
Publication
Which buyer decision changes because of this score?
Risk note
Leaderboard control room
Freshness
Task creation date, source window, and retirement status
Refresh public sample monthly; retire saturated items.
Holdout
Public/private split, holdout size, and leakage policy
Keep high-value tasks private until replacement tasks exist.
Repeat runs
Seeds, temperature, run count, model version, and timestamps
Repeat any result near a leaderboard boundary.
Outcome rubric
Expected output, required evidence, severity labels, reviewer notes
Review rubric drift after every task-family update.
Provenance
Trace id, prompt packet, tool logs, latency, cost, and reviewer
Store raw traces before publishing aggregate claims.
Stress variants
Adversarial paraphrases, missing-field cases, multilingual variants
Add variants when the first failure cluster appears.
Public/private split
40/60
Publish enough examples for trust, but keep most decision-grade tasks private so the benchmark remains useful after launch.
Minimum run packet
Trace + rubric
A score should never appear without a trace id, model version, scoring rubric, reviewer note, and failure category.
Retirement trigger
Saturation
When most frontier and mid-tier models cluster near the ceiling, the task stops separating buyers' choices and should leave the active score.
Buyer artifact
Risk memo
The published leaderboard should end in a deployment recommendation, not a single global winner.
Recommendation
The right model, benchmark, or interpretability method depends on the workflow, risk tolerance, budget, latency target, data sensitivity, and the cost of a wrong answer.