Benchmarks

Building a Useful AI Leaderboard Without Fooling Ourselves

A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.

15 May 2026/12 min read

Research lens

6 controls

4 leaderboard tracks

The scoreboard is not the benchmark

A leaderboard is only the visible layer. The real asset is the protocol: task sampling, private holdouts, scoring rubrics, repeat runs, audit logs, model settings, and a policy for retiring stale tasks. Without that protocol, the ranking becomes a design object with numbers attached.

Indian enterprise work needs outcome scoring

Generic tests are useful for orientation, but deployment decisions need tasks that look like support tickets, GST invoice checks, sales-call summaries, bilingual handoffs, policy retrieval, and document-heavy back-office work. The unit of measurement should be completed workflow, not isolated answer accuracy.

Anti-contamination has to be designed in

The site should use a rotating public sample, a private holdout, timestamped task provenance, adversarial prompt variants, and periodic benchmark retirement. LiveBench and SWE-bench-Live point in the right direction: refresh the task distribution so models cannot simply memorize yesterday's test.

Visual

How leaderboard trust decays without controls

Illustrative trust score; higher is better.

Fresh tasks92

Private holdout86

Open static set54

Saturated trivia31

Task freshness

Reduces training-data leakage

Monthly workflow refresh with provenance

Private holdout

Prevents direct optimization

Keep a non-public eval split

Multiple seeds

Controls prompt variance

Repeat runs across temperatures/settings

Outcome rubrics

Measures business utility

Human review plus structured scoring

Process

How to read the analysis

1Task intake

2Private split

3Repeated runs

4Failure audit

5Published score

Contamination risk

High

Static public tests become weak evidence once models and prompts optimize against them.

Best unit

Workflow

Score the completed business outcome, not only the model's isolated answer.

Refresh cycle

Monthly

Retire stale tasks and keep a dated public sample for reproducibility.

LiveBench pattern

Fresh, objective tasks

LiveBench limits contamination by drawing from recently released material and objective ground truth; Edxperimental should use the same idea for Indian workflow tasks.

SWE-bench-Live warning

Static scores can overstate readiness

Live issue-resolution benchmarks show why repository diversity, Dockerized execution, and update cadence matter for coding-agent claims.

Edxperimental control

Public sample plus private holdout

Publish enough tasks for trust, keep enough private tasks for signal, and rotate stale items out of the scoring pool.

Research map

Benchmark evidence stack

A useful leaderboard makes each layer inspectable, from task sourcing to buyer recommendation. The map below is the minimum evidence chain before a ranking deserves trust.

Task source

Workflow sample

Where did this task come from, and could it be in training data?

Trace provenance

Gold answer

Expected output

What exactly counts as success before the model runs?

Rubric packet

Model run

Repeated seeds

Is this a stable result or one lucky prompt path?

Run ledger

Human audit

Failure taxonomy

What broke, how severe was it, and who reviewed it?

Reviewer memo

Publication

Decision context

Which buyer decision changes because of this score?

Risk note

Leaderboard control room

How to keep rankings from becoming theatre

Freshness

Models memorizing static public tasks

Task creation date, source window, and retirement status

Refresh public sample monthly; retire saturated items.

Holdout

Teams optimizing directly against the visible set

Public/private split, holdout size, and leakage policy

Keep high-value tasks private until replacement tasks exist.

Repeat runs

One lucky prompt or provider variance deciding rank

Seeds, temperature, run count, model version, and timestamps

Repeat any result near a leaderboard boundary.

Outcome rubric

High fluency hiding a wrong business result

Expected output, required evidence, severity labels, reviewer notes

Review rubric drift after every task-family update.

Provenance

Scores that cannot be audited later

Trace id, prompt packet, tool logs, latency, cost, and reviewer

Store raw traces before publishing aggregate claims.

Stress variants

A model passing the happy path but failing realistic noise

Adversarial paraphrases, missing-field cases, multilingual variants

Add variants when the first failure cluster appears.

Public/private split

40/60

Publish enough examples for trust, but keep most decision-grade tasks private so the benchmark remains useful after launch.

Minimum run packet

Trace + rubric

A score should never appear without a trace id, model version, scoring rubric, reviewer note, and failure category.

Retirement trigger

Saturation

When most frontier and mid-tier models cluster near the ceiling, the task stops separating buyers' choices and should leave the active score.

Buyer artifact

Risk memo

The published leaderboard should end in a deployment recommendation, not a single global winner.

Recommendation

Use this as a decision tool, not a belief system.

The right model, benchmark, or interpretability method depends on the workflow, risk tolerance, budget, latency target, data sensitivity, and the cost of a wrong answer.

Sources

LiveBench paper SWE-bench-Live HELM OpenAI Evals ConStat contamination detection