Benchmark Lab

Evaluation suites for real AI buying decisions.

Leaderboards are the output. Benchmarks are the machine underneath: task design, gold answers, scoring rubrics, repeated runs, traces, and failure analysis.

Benchmark Evidence Readiness

Do not let scaffold traces masquerade as leaderboard proof.

Current benchmark pages are useful as scaffolds and buyer methodology examples. They should not be treated as final public leaderboard evidence until raw harness logs, provider response ids, screenshot/state proof, repeated seeds, and reviewer signoff are imported.

57/100

Average readiness

7

Evidence gates

5

Suites checked

Not leaderboard-ready

Indian Enterprise Workflow Suite

57Ready

Not leaderboard-ready

Coding Agent Maintenance Suite

57Ready

Not leaderboard-ready

Browser Operations Suite

57Ready

Not leaderboard-ready

Support Agent Policy Suite

57Ready

Not leaderboard-ready

AI Security & Risk Suite

57Ready
GateWeightProof required
Task design and split14Public/private split, task brief, expected output, scoring focus, leakage policy, and retirement rule exist for the suite.
Gold answer and rubric16Every scored task has expected evidence, rubric components, and reviewer-facing scorecard.
Raw run log16Every model run has raw harness logs, prompt packet id, run seed, exact model version, tool calls, retries, and answer excerpt.
Provider response identity12Every model run preserves provider, model identifier, model version, timestamp, and response id or equivalent audit handle.
Screenshot or state proof12Browser/app/document tasks preserve screenshot, DOM state, file diff, or comparable state proof for completion claims.
Repeat-run stability14Leaderboard-boundary results are repeated across multiple seeds and instability is reported before ranking.
Reviewer signoff16A named reviewer signs the scorecard, failure reason, and buyer recommendation after inspecting trace artifacts.

Indian Enterprise Workflow Suite

Blocking evidence still needed.

Raw run log · Scaffold only

Import raw notebook or harness logs through the benchmark intake kit before leaderboard claims depend on these rows.

Provider response identity · Missing real evidence

Capture exact provider model identifiers, timestamps, request ids, and response ids for every run.

Screenshot or state proof · Missing real evidence

Attach screenshots, DOM snapshots, diffs, or document outputs to each artifact bundle.

Coding Agent Maintenance Suite

Blocking evidence still needed.

Raw run log · Scaffold only

Import raw notebook or harness logs through the benchmark intake kit before leaderboard claims depend on these rows.

Provider response identity · Missing real evidence

Capture exact provider model identifiers, timestamps, request ids, and response ids for every run.

Screenshot or state proof · Missing real evidence

Attach screenshots, DOM snapshots, diffs, or document outputs to each artifact bundle.

Indian Workflow Dataset

v0.1 seed design with public samples and private holdouts.

Indian Enterprise Workflow Benchmark v0.1 seed dataset design for public samples and private holdouts. The task list expands the benchmark beyond four inspectable traces into a fuller dataset blueprint for future harness runs.

30

Tasks

18

Public

12

Holdout

Domain

Finance

4

2

Public

2

Holdout

Domain

Support

4

2

Public

2

Holdout

Domain

Sales Ops

4

2

Public

2

Holdout

Domain

Legal

4

2

Public

2

Holdout

Domain

Procurement

4

3

Public

1

Holdout

Domain

HR Ops

4

3

Public

1

Holdout

Domain

Healthcare Admin

3

2

Public

1

Holdout

Domain

Field Operations

3

2

Public

1

Holdout

CSV trace importer

Benchmark pages now ingest notebook-style run rows.

The generator merges structured JSON task packets with CSV run rows, then publishes static suite and trace pages. This gives the team a practical bridge from spreadsheet reviews, notebooks, and real provider runs into the website.

JSON task packets

Task brief, expected output, rubric, evidence, and failure mode.

CSV run rows

One row per model run with score, cost units, latency, answer excerpt, tools, and reviewer note.

20 trace pages

Four inspectable traces per suite are now generated from mixed JSON and CSV inputs.

Indian Workflow Gold Packets

Source packets, gold answers, reviewer notes, and scoring checklists.

The dataset now includes a reviewer-ready packet layer for every seed task. Public packets expose redacted methodology samples; holdout packets preserve private source summaries until replacement tasks exist.

30

Gold packets

18

Public

12

Holdout

Finance / publicGst Credit Note ReconciliationSanjay Prasad · public-redacted-sampleCite the mismatch, calculate corrected payable, and draft a vendor note. The answer must cite invoice line, credit note, vendor email and avoid adding facts outside the source packet.Finance / publicTds Deduction QuerySanjay Prasad · public-redacted-sampleExplain deduction basis and identify whether finance escalation is needed. The answer must cite TDS policy, payment ledger, invoice total and avoid adding facts outside the source packet.Finance / holdoutAdvance Payment VarianceSanjay Prasad · private-holdout-summaryIdentify variance and recommend release, hold, or review. The answer must cite contract clause, milestone note, payment status and avoid adding facts outside the source packet.Finance / holdoutQuarterly Budget ExceptionSanjay Prasad · private-holdout-summaryClassify approval path and cite the missing approval if any. The answer must cite budget policy, email approval, expense category and avoid adding facts outside the source packet.Support / holdoutHindi English Refund EscalationSaujas · private-holdout-summaryReply safely and escalate only the payment reconciliation issue. The answer must cite refund policy, delivery timestamp, payment status and avoid adding facts outside the source packet.Support / publicRegional Language Complaint RoutingSaujas · public-redacted-sampleClassify intent, preserve tone, and request only missing evidence. The answer must cite complaint text, order history, queue policy and avoid adding facts outside the source packet.

Benchmark Intake Kit

A landing zone for real model, agent, and notebook runs.

A concrete intake kit for replacing synthetic benchmark rows with real provider, notebook, browser-agent, and coding-agent run exports. The kit defines the fields, evidence gates, and review steps needed before a run can affect a leaderboard or buyer report.

16

Required fields

4

Evidence gates

1

Capture

Export raw model/provider/notebook runs with exact task ids and model ids before any scoring cleanup.

2

Normalize

Map the export into the CSV template and attach artifact URIs for prompts, outputs, traces, and screenshots.

3

Review

Score each run against the suite rubric, write a reviewer note, and flag missing evidence before publishing.

4

Generate

Run pnpm benchmarks:generate to merge accepted rows into static suite, trace, report, and artifact pages.

5

Verify

Run pnpm benchmarks:replay for representative traces and pnpm verify:site against the local Webpack dev server.

Packet identity

A benchmark result is unusable if the task, prompt, source packet, and model output cannot be tied to stable ids.

suiteSlugtaskIdpromptPacketHashinputArtifactUrioutputArtifactUri

Run reproducibility

Close leaderboard calls need repeated runs, exact model identifiers, seeds, and enough metadata to reproduce the route.

modelIdproviderrunSeedstartedAtlatencyMs

State proof

Agent and browser tasks must prove completion through tool traces, screenshots, file artifacts, or terminal logs rather than claims.

toolTraceUriscreenshotUrioutputArtifactUri

Scoring audit

A score needs reviewer rationale so a buyer can inspect why a run was accepted, partially accepted, or rejected.

scorereviewerreviewerNote

Benchmark Harness Kit

Adapter contracts for turning real runs into reviewable evidence.

Adapter contracts and sample rows for moving real provider, browser-agent, coding-agent, and support-agent benchmark exports into the Edxperimental Labs intake pipeline. The kit closes the gap between raw harness output and the current intake form, CSV template, replay scaffold, and benchmark pages.

4

Adapters

4

Quality gates

5

Pipeline lanes

1

Export

Harness emits raw JSONL/trace artifacts before a reviewer touches the score.

2

Validate

Schema checks confirm task identity, model identity, prompt packet, artifacts, latency, and cost fields.

3

Review

Human reviewer assigns score, failure category, and acceptance state with a short rationale.

4

Intake

Accepted rows enter /api/benchmark-run-intake or the CSV template for static generation.

5

Publish

Generated benchmark pages expose aggregate scores while preserving artifact links and evidence caveats.

GateProofBlocks
Immutable task packetThe row references a task id, source packet, prompt packet hash, and expected-output rubric.Prevents silently changing the task after a model has already run.
Exact system identityThe row records provider, model id, model version, agent version, route, and settings.Prevents comparing vague brand names instead of reproducible systems.
State and artifact proofThe row links raw answer, tool trace, screenshot, diff, terminal log, or transcript as appropriate.Prevents fluent completion claims from becoming benchmark evidence.
Human scoring reasonThe row includes reviewer, score, failure category, acceptance state, and reviewer note.Prevents a number from appearing without an auditable reason.

Benchmark Run Intake Form

Capture real run packets before they affect rankings.

This form writes the same fields required by the generated run schema into a review queue. It is a practical bridge from notebooks, provider exports, browser traces, and coding-agent runs into the CSV normalization path.

Captured

Raw packet lands in gitignored NDJSON for review.

Not ranked

No run affects a leaderboard until artifacts and reviewer signoff exist.

Capture a real run packet before it enters CSV normalization or leaderboard scoring.

Designing v0.1

Indian Enterprise Workflow Suite

69

A private-plus-public benchmark for support, finance, legal, sales, document, and multilingual back-office workflows common in Indian enterprises.

Generated run: 24 tasks (10 public / 14 private), 4 inspectable traces

Open run report
Sample tasks
GST invoice reconciliation
Hindi-English support escalation
Sales-call CRM update
Policy document retrieval
Scoring
Outcome correctness
Evidence citation
Escalation judgement
Cost per accepted output
Model classScoreRecoveryCost
Frontier reasoning model888352
Fast mid-tier model766681
Open-weight local model614973
Small routing model523692

Prototype

Coding Agent Maintenance Suite

56

Repository-level tasks for coding agents: reading a codebase, making scoped patches, running tests, inspecting screenshots, and avoiding unrelated churn.

Generated run: 20 tasks (8 public / 12 private), 4 inspectable traces

Open run report
Sample tasks
Bug reproduction
Patch planning
Test repair
Frontend visual QA
Scoring
Patch correctness
Regression rate
Tool discipline
Review readiness
Model classScoreRecoveryCost
Frontier reasoning model767447
Fast mid-tier model625879
Open-weight local model514571
Small routing model342894

Research

Browser Operations Suite

51

Browser-agent tasks for navigation, structured extraction, authenticated workflows, form filling, and UI-state verification under changing pages.

Generated run: 16 tasks (6 public / 10 private), 4 inspectable traces

Open run report
Sample tasks
Multi-step navigation
Form completion
Evidence extraction
Screenshot verification
Scoring
Task success
State verification
Recovery quality
Human handoff rate
Model classScoreRecoveryCost
Frontier reasoning model716949
Fast mid-tier model605582
Open-weight local model433976
Small routing model312595

Consulting

Support Agent Policy Suite

68

Customer-support simulations that test policy adherence, multilingual tone, escalation safety, refund/exception handling, and hallucination resistance.

Generated run: 20 tasks (8 public / 12 private), 4 inspectable traces

Open run report
Sample tasks
Refund decision
Policy lookup
Angry customer de-escalation
Regional-language handoff
Scoring
Resolution rate
Policy compliance
Tone control
Escalation precision
Model classScoreRecoveryCost
Frontier reasoning model868251
Fast mid-tier model797083
Open-weight local model584774
Small routing model493593

New v0.1

AI Security & Risk Suite

63

Agent and LLM-application security tasks for prompt injection, tool-permission boundaries, data exposure control, and risk escalation discipline.

Generated run: 16 tasks (6 public / 10 private), 4 inspectable traces

Open run report
Sample tasks
Prompt-injection triage
Tool approval boundary
Sensitive-data redaction
AI risk incident memo
Scoring
Attack recognition
Policy boundary
Data exposure control
Safe escalation
Model classScoreRecoveryCost
Frontier reasoning model827850
Fast mid-tier model716482
Open-weight local model544475
Small routing model463194

Research pipeline

From consulting project to public benchmark.

The strongest benchmarks will come from repeated client questions: which model, which agent, what risk, and what budget. The public site should publish the reusable evaluation pattern without exposing private client data.

Build order

1Collect seed tasks from real workflows
2Write gold answers and failure conditions
3Run 3-5 model/provider baselines
4Publish public sample and methodology
5Keep holdout private and refresh monthly