Not leaderboard-ready
Benchmark Lab
Evaluation suites for real AI buying decisions.
Leaderboards are the output. Benchmarks are the machine underneath: task design, gold answers, scoring rubrics, repeated runs, traces, and failure analysis.
Benchmark Evidence Readiness
Do not let scaffold traces masquerade as leaderboard proof.
Current benchmark pages are useful as scaffolds and buyer methodology examples. They should not be treated as final public leaderboard evidence until raw harness logs, provider response ids, screenshot/state proof, repeated seeds, and reviewer signoff are imported.
57/100
Average readiness
7
Evidence gates
5
Suites checked
Not leaderboard-ready
Coding Agent Maintenance Suite
Not leaderboard-ready
Browser Operations Suite
Not leaderboard-ready
Support Agent Policy Suite
Not leaderboard-ready
AI Security & Risk Suite
Indian Enterprise Workflow Suite
Blocking evidence still needed.
Raw run log · Scaffold only
Import raw notebook or harness logs through the benchmark intake kit before leaderboard claims depend on these rows.
Provider response identity · Missing real evidence
Capture exact provider model identifiers, timestamps, request ids, and response ids for every run.
Screenshot or state proof · Missing real evidence
Attach screenshots, DOM snapshots, diffs, or document outputs to each artifact bundle.
Coding Agent Maintenance Suite
Blocking evidence still needed.
Raw run log · Scaffold only
Import raw notebook or harness logs through the benchmark intake kit before leaderboard claims depend on these rows.
Provider response identity · Missing real evidence
Capture exact provider model identifiers, timestamps, request ids, and response ids for every run.
Screenshot or state proof · Missing real evidence
Attach screenshots, DOM snapshots, diffs, or document outputs to each artifact bundle.
Indian Workflow Dataset
v0.1 seed design with public samples and private holdouts.
Indian Enterprise Workflow Benchmark v0.1 seed dataset design for public samples and private holdouts. The task list expands the benchmark beyond four inspectable traces into a fuller dataset blueprint for future harness runs.
30
Tasks
18
Public
12
Holdout
Domain
Finance
2
Public
2
Holdout
Domain
Support
2
Public
2
Holdout
Domain
Sales Ops
2
Public
2
Holdout
Domain
Legal
2
Public
2
Holdout
Domain
Procurement
3
Public
1
Holdout
Domain
HR Ops
3
Public
1
Holdout
Domain
Healthcare Admin
2
Public
1
Holdout
Domain
Field Operations
2
Public
1
Holdout
CSV trace importer
Benchmark pages now ingest notebook-style run rows.
The generator merges structured JSON task packets with CSV run rows, then publishes static suite and trace pages. This gives the team a practical bridge from spreadsheet reviews, notebooks, and real provider runs into the website.
JSON task packets
Task brief, expected output, rubric, evidence, and failure mode.
CSV run rows
One row per model run with score, cost units, latency, answer excerpt, tools, and reviewer note.
20 trace pages
Four inspectable traces per suite are now generated from mixed JSON and CSV inputs.
Indian Workflow Gold Packets
Source packets, gold answers, reviewer notes, and scoring checklists.
The dataset now includes a reviewer-ready packet layer for every seed task. Public packets expose redacted methodology samples; holdout packets preserve private source summaries until replacement tasks exist.
30
Gold packets
18
Public
12
Holdout
Benchmark Intake Kit
A landing zone for real model, agent, and notebook runs.
A concrete intake kit for replacing synthetic benchmark rows with real provider, notebook, browser-agent, and coding-agent run exports. The kit defines the fields, evidence gates, and review steps needed before a run can affect a leaderboard or buyer report.
16
Required fields
4
Evidence gates
Capture
Export raw model/provider/notebook runs with exact task ids and model ids before any scoring cleanup.
Normalize
Map the export into the CSV template and attach artifact URIs for prompts, outputs, traces, and screenshots.
Review
Score each run against the suite rubric, write a reviewer note, and flag missing evidence before publishing.
Generate
Run pnpm benchmarks:generate to merge accepted rows into static suite, trace, report, and artifact pages.
Verify
Run pnpm benchmarks:replay for representative traces and pnpm verify:site against the local Webpack dev server.
Packet identity
A benchmark result is unusable if the task, prompt, source packet, and model output cannot be tied to stable ids.
Run reproducibility
Close leaderboard calls need repeated runs, exact model identifiers, seeds, and enough metadata to reproduce the route.
State proof
Agent and browser tasks must prove completion through tool traces, screenshots, file artifacts, or terminal logs rather than claims.
Scoring audit
A score needs reviewer rationale so a buyer can inspect why a run was accepted, partially accepted, or rejected.
Benchmark Harness Kit
Adapter contracts for turning real runs into reviewable evidence.
Adapter contracts and sample rows for moving real provider, browser-agent, coding-agent, and support-agent benchmark exports into the Edxperimental Labs intake pipeline. The kit closes the gap between raw harness output and the current intake form, CSV template, replay scaffold, and benchmark pages.
4
Adapters
4
Quality gates
5
Pipeline lanes
Export
Harness emits raw JSONL/trace artifacts before a reviewer touches the score.
Validate
Schema checks confirm task identity, model identity, prompt packet, artifacts, latency, and cost fields.
Review
Human reviewer assigns score, failure category, and acceptance state with a short rationale.
Intake
Accepted rows enter /api/benchmark-run-intake or the CSV template for static generation.
Publish
Generated benchmark pages expose aggregate scores while preserving artifact links and evidence caveats.
Benchmark Run Intake Form
Capture real run packets before they affect rankings.
This form writes the same fields required by the generated run schema into a review queue. It is a practical bridge from notebooks, provider exports, browser traces, and coding-agent runs into the CSV normalization path.
Captured
Raw packet lands in gitignored NDJSON for review.
Not ranked
No run affects a leaderboard until artifacts and reviewer signoff exist.
Designing v0.1
Indian Enterprise Workflow Suite
A private-plus-public benchmark for support, finance, legal, sales, document, and multilingual back-office workflows common in Indian enterprises.
Generated run: 24 tasks (10 public / 14 private), 4 inspectable traces
Open run reportPrototype
Coding Agent Maintenance Suite
Repository-level tasks for coding agents: reading a codebase, making scoped patches, running tests, inspecting screenshots, and avoiding unrelated churn.
Generated run: 20 tasks (8 public / 12 private), 4 inspectable traces
Open run reportResearch
Browser Operations Suite
Browser-agent tasks for navigation, structured extraction, authenticated workflows, form filling, and UI-state verification under changing pages.
Generated run: 16 tasks (6 public / 10 private), 4 inspectable traces
Open run reportConsulting
Support Agent Policy Suite
Customer-support simulations that test policy adherence, multilingual tone, escalation safety, refund/exception handling, and hallucination resistance.
Generated run: 20 tasks (8 public / 12 private), 4 inspectable traces
Open run reportNew v0.1
AI Security & Risk Suite
Agent and LLM-application security tasks for prompt injection, tool-permission boundaries, data exposure control, and risk escalation discipline.
Generated run: 16 tasks (6 public / 10 private), 4 inspectable traces
Open run reportResearch pipeline
From consulting project to public benchmark.
The strongest benchmarks will come from repeated client questions: which model, which agent, what risk, and what budget. The public site should publish the reusable evaluation pattern without exposing private client data.
Build order