Benchmark Lab

Evaluation suites for real AI buying decisions.

Leaderboards are the output. Benchmarks are the machine underneath: task design, gold answers, scoring rubrics, repeated runs, traces, and failure analysis.

View public rankings Read v0.1 design

Benchmark Evidence Readiness

Do not let scaffold traces masquerade as leaderboard proof.

Current benchmark pages are useful as scaffolds and buyer methodology examples. They should not be treated as final public leaderboard evidence until raw harness logs, provider response ids, screenshot/state proof, repeated seeds, and reviewer signoff are imported.

57/100

Average readiness

Evidence gates

Suites checked

Readiness MD Readiness JSON

Not leaderboard-ready

Indian Enterprise Workflow Suite

57Ready

Not leaderboard-ready

Coding Agent Maintenance Suite

57Ready

Not leaderboard-ready

Browser Operations Suite

57Ready

Not leaderboard-ready

Support Agent Policy Suite

57Ready

Not leaderboard-ready

AI Security & Risk Suite

57Ready

GateWeightProof required

Task design and split14Public/private split, task brief, expected output, scoring focus, leakage policy, and retirement rule exist for the suite.

Gold answer and rubric16Every scored task has expected evidence, rubric components, and reviewer-facing scorecard.

Raw run log16Every model run has raw harness logs, prompt packet id, run seed, exact model version, tool calls, retries, and answer excerpt.

Provider response identity12Every model run preserves provider, model identifier, model version, timestamp, and response id or equivalent audit handle.

Screenshot or state proof12Browser/app/document tasks preserve screenshot, DOM state, file diff, or comparable state proof for completion claims.

Repeat-run stability14Leaderboard-boundary results are repeated across multiple seeds and instability is reported before ranking.

Reviewer signoff16A named reviewer signs the scorecard, failure reason, and buyer recommendation after inspecting trace artifacts.

Indian Enterprise Workflow Suite

Blocking evidence still needed.

Raw run log · Scaffold only

Import raw notebook or harness logs through the benchmark intake kit before leaderboard claims depend on these rows.

Provider response identity · Missing real evidence

Capture exact provider model identifiers, timestamps, request ids, and response ids for every run.

Screenshot or state proof · Missing real evidence

Attach screenshots, DOM snapshots, diffs, or document outputs to each artifact bundle.

Coding Agent Maintenance Suite

Blocking evidence still needed.

Raw run log · Scaffold only

Import raw notebook or harness logs through the benchmark intake kit before leaderboard claims depend on these rows.

Provider response identity · Missing real evidence

Capture exact provider model identifiers, timestamps, request ids, and response ids for every run.

Screenshot or state proof · Missing real evidence

Attach screenshots, DOM snapshots, diffs, or document outputs to each artifact bundle.

Indian Workflow Dataset

v0.1 seed design with public samples and private holdouts.

Indian Enterprise Workflow Benchmark v0.1 seed dataset design for public samples and private holdouts. The task list expands the benchmark beyond four inspectable traces into a fuller dataset blueprint for future harness runs.

Tasks

Public

Holdout

Markdown design CSV seed tasks JSON dataset

Domain

Finance

Public

Holdout

Domain

Support

Public

Holdout

Domain

Sales Ops

Public

Holdout

Domain

Legal

Public

Holdout

Domain

Procurement

Public

Holdout

Domain

HR Ops

Public

Holdout

Domain

Healthcare Admin

Public

Holdout

Domain

Field Operations

Public

Holdout

CSV trace importer

Benchmark pages now ingest notebook-style run rows.

The generator merges structured JSON task packets with CSV run rows, then publishes static suite and trace pages. This gives the team a practical bridge from spreadsheet reviews, notebooks, and real provider runs into the website.

JSON task packets

Task brief, expected output, rubric, evidence, and failure mode.

CSV run rows

One row per model run with score, cost units, latency, answer excerpt, tools, and reviewer note.

20 trace pages

Four inspectable traces per suite are now generated from mixed JSON and CSV inputs.

Indian Workflow Gold Packets

Source packets, gold answers, reviewer notes, and scoring checklists.

The dataset now includes a reviewer-ready packet layer for every seed task. Public packets expose redacted methodology samples; holdout packets preserve private source summaries until replacement tasks exist.

Gold packets

Public

Holdout

Packet manifest Sample packet

Finance / publicGst Credit Note ReconciliationSanjay Prasad · public-redacted-sampleCite the mismatch, calculate corrected payable, and draft a vendor note. The answer must cite invoice line, credit note, vendor email and avoid adding facts outside the source packet.Finance / publicTds Deduction QuerySanjay Prasad · public-redacted-sampleExplain deduction basis and identify whether finance escalation is needed. The answer must cite TDS policy, payment ledger, invoice total and avoid adding facts outside the source packet.Finance / holdoutAdvance Payment VarianceSanjay Prasad · private-holdout-summaryIdentify variance and recommend release, hold, or review. The answer must cite contract clause, milestone note, payment status and avoid adding facts outside the source packet.Finance / holdoutQuarterly Budget ExceptionSanjay Prasad · private-holdout-summaryClassify approval path and cite the missing approval if any. The answer must cite budget policy, email approval, expense category and avoid adding facts outside the source packet.Support / holdoutHindi English Refund EscalationSaujas · private-holdout-summaryReply safely and escalate only the payment reconciliation issue. The answer must cite refund policy, delivery timestamp, payment status and avoid adding facts outside the source packet.Support / publicRegional Language Complaint RoutingSaujas · public-redacted-sampleClassify intent, preserve tone, and request only missing evidence. The answer must cite complaint text, order history, queue policy and avoid adding facts outside the source packet.

Benchmark Intake Kit

A landing zone for real model, agent, and notebook runs.

A concrete intake kit for replacing synthetic benchmark rows with real provider, notebook, browser-agent, and coding-agent run exports. The kit defines the fields, evidence gates, and review steps needed before a run can affect a leaderboard or buyer report.

Required fields

Evidence gates

Runbook CSV template JSON schema

Capture

Export raw model/provider/notebook runs with exact task ids and model ids before any scoring cleanup.

Normalize

Map the export into the CSV template and attach artifact URIs for prompts, outputs, traces, and screenshots.

Review

Score each run against the suite rubric, write a reviewer note, and flag missing evidence before publishing.

Generate

Run pnpm benchmarks:generate to merge accepted rows into static suite, trace, report, and artifact pages.

Verify

Run pnpm benchmarks:replay for representative traces and pnpm verify:site against the local Webpack dev server.

Packet identity

A benchmark result is unusable if the task, prompt, source packet, and model output cannot be tied to stable ids.

suiteSlugtaskIdpromptPacketHashinputArtifactUrioutputArtifactUri

Run reproducibility

Close leaderboard calls need repeated runs, exact model identifiers, seeds, and enough metadata to reproduce the route.

modelIdproviderrunSeedstartedAtlatencyMs

State proof

Agent and browser tasks must prove completion through tool traces, screenshots, file artifacts, or terminal logs rather than claims.

toolTraceUriscreenshotUrioutputArtifactUri

Scoring audit

A score needs reviewer rationale so a buyer can inspect why a run was accepted, partially accepted, or rejected.

scorereviewerreviewerNote

Benchmark Harness Kit

Adapter contracts for turning real runs into reviewable evidence.

Adapter contracts and sample rows for moving real provider, browser-agent, coding-agent, and support-agent benchmark exports into the Edxperimental Labs intake pipeline. The kit closes the gap between raw harness output and the current intake form, CSV template, replay scaffold, and benchmark pages.

Adapters

Quality gates

Pipeline lanes

Harness runbook Adapter spec Sample rows

Export

Harness emits raw JSONL/trace artifacts before a reviewer touches the score.

Validate

Schema checks confirm task identity, model identity, prompt packet, artifacts, latency, and cost fields.

Review

Human reviewer assigns score, failure category, and acceptance state with a short rationale.

Intake

Accepted rows enter /api/benchmark-run-intake or the CSV template for static generation.

Publish

Generated benchmark pages expose aggregate scores while preserving artifact links and evidence caveats.

Benchmark engineerProvider API HarnessOne JSONL row per model run with model id, provider id, latency, token/cost fields, response artifact URI, and reviewer queue id.provider response idmodel versionraw prompt packet Agent evaluatorBrowser Agent HarnessTrace archive with DOM checkpoints, screenshot evidence, console warnings, timing, recovered errors, and final state proof.screenshot proofDOM checkpointtool trace Engineering reviewerCoding Agent HarnessPatch bundle with diff, terminal log, test output, browser proof when needed, and reviewer verdict.git difftest loglint/build log Policy reviewerSupport Agent HarnessConversation transcript with policy citations, escalation choice, tone review, and final customer outcome.policy citationconversation transcripthandoff decision

GateProofBlocks

Immutable task packetThe row references a task id, source packet, prompt packet hash, and expected-output rubric.Prevents silently changing the task after a model has already run.

Exact system identityThe row records provider, model id, model version, agent version, route, and settings.Prevents comparing vague brand names instead of reproducible systems.

State and artifact proofThe row links raw answer, tool trace, screenshot, diff, terminal log, or transcript as appropriate.Prevents fluent completion claims from becoming benchmark evidence.

Human scoring reasonThe row includes reviewer, score, failure category, acceptance state, and reviewer note.Prevents a number from appearing without an auditable reason.

Benchmark Run Intake Form

Capture real run packets before they affect rankings.

This form writes the same fields required by the generated run schema into a review queue. It is a practical bridge from notebooks, provider exports, browser traces, and coding-agent runs into the CSV normalization path.

Captured

Raw packet lands in gitignored NDJSON for review.

Not ranked

No run affects a leaderboard until artifacts and reviewer signoff exist.

Designing v0.1

Indian Enterprise Workflow Suite

A private-plus-public benchmark for support, finance, legal, sales, document, and multilingual back-office workflows common in Indian enterprises.

Generated run: 24 tasks (10 public / 14 private), 4 inspectable traces

Open run report

Sample tasks

GST invoice reconciliation

Hindi-English support escalation

Sales-call CRM update

Policy document retrieval

Scoring

Outcome correctness

Evidence citation

Escalation judgement

Cost per accepted output

Model classScoreRecoveryCost

Frontier reasoning model888352

Fast mid-tier model766681

Open-weight local model614973

Small routing model523692

Prototype

Coding Agent Maintenance Suite

Repository-level tasks for coding agents: reading a codebase, making scoped patches, running tests, inspecting screenshots, and avoiding unrelated churn.

Generated run: 20 tasks (8 public / 12 private), 4 inspectable traces

Open run report

Sample tasks

Bug reproduction

Patch planning

Test repair

Frontend visual QA

Scoring

Patch correctness

Regression rate

Tool discipline

Review readiness

Model classScoreRecoveryCost

Frontier reasoning model767447

Fast mid-tier model625879

Open-weight local model514571

Small routing model342894

Research

Browser Operations Suite

Browser-agent tasks for navigation, structured extraction, authenticated workflows, form filling, and UI-state verification under changing pages.

Generated run: 16 tasks (6 public / 10 private), 4 inspectable traces

Open run report

Sample tasks

Multi-step navigation

Form completion

Evidence extraction

Screenshot verification

Scoring

Task success

State verification

Recovery quality

Human handoff rate

Model classScoreRecoveryCost

Frontier reasoning model716949

Fast mid-tier model605582

Open-weight local model433976

Small routing model312595

Consulting

Support Agent Policy Suite

Customer-support simulations that test policy adherence, multilingual tone, escalation safety, refund/exception handling, and hallucination resistance.

Generated run: 20 tasks (8 public / 12 private), 4 inspectable traces

Open run report

Sample tasks

Refund decision

Policy lookup

Angry customer de-escalation

Regional-language handoff

Scoring

Resolution rate

Policy compliance

Tone control

Escalation precision

Model classScoreRecoveryCost

Frontier reasoning model868251

Fast mid-tier model797083

Open-weight local model584774

Small routing model493593

New v0.1

AI Security & Risk Suite

Agent and LLM-application security tasks for prompt injection, tool-permission boundaries, data exposure control, and risk escalation discipline.

Generated run: 16 tasks (6 public / 10 private), 4 inspectable traces

Open run report

Sample tasks

Prompt-injection triage

Tool approval boundary

Sensitive-data redaction

AI risk incident memo

Scoring

Attack recognition

Policy boundary

Data exposure control

Safe escalation

Model classScoreRecoveryCost

Frontier reasoning model827850

Fast mid-tier model716482

Open-weight local model544475

Small routing model463194

Research pipeline

From consulting project to public benchmark.

The strongest benchmarks will come from repeated client questions: which model, which agent, what risk, and what budget. The public site should publish the reusable evaluation pattern without exposing private client data.

Build order

1Collect seed tasks from real workflows

2Write gold answers and failure conditions

3Run 3-5 model/provider baselines

4Publish public sample and methodology

5Keep holdout private and refresh monthly