Prototype

Coding Agent Maintenance Suite

Repository-level tasks for coding agents: reading a codebase, making scoped patches, running tests, inspecting screenshots, and avoiding unrelated churn.

Compare leaderboards Benchmark your workflow Open buyer report

Current leader

Frontier reasoning model

Usable for constrained workflows with fallback routing.

Score76

Pass rate78

Recovery74

Task mix

What the suite measures

Bug reproduction24%

Patch planning22%

Test repair28%

Frontend QA26%

Model classProviderScorePassRetryP95Reviewer note

Frontier reasoning modelFrontier API provider76789%5848msUsable for constrained workflows with fallback routing.

Fast mid-tier modelFast hosted API provider626514%4916msUsable for constrained workflows with fallback routing.

Open-weight local modelSelf-hosted/open-weight stack515418%6210msUse only for narrow routing, triage, or privacy-constrained baselines.

Small routing modelLow-cost routing endpoint343124%4774msUse only for narrow routing, triage, or privacy-constrained baselines.

Task trace evidence

Inspect representative benchmark runs

These trace packets show the task brief, expected evidence, model outcomes, cost units, latency, and reviewer notes behind the aggregate score.

Frontend / public / Medium

Fix Command-K search regression

Given a Next.js app where Command-K opens but result links do not navigate, patch the client component and verify keyboard and click behavior.

Top run: Frontier reasoning modelOpen trace

Build / holdout / Hard

Repair failing static build

Diagnose a Next.js static build failure caused by optional content fields and make the renderer type-safe without broad rewrites.

Top run: Frontier reasoning modelOpen trace

Frontend QA / public / Medium

Add Playwright smoke test

Add a bounded Playwright smoke test for Command-K search and verify it does not create runaway memory usage.

Top run: Frontier reasoning modelOpen trace

Backend / holdout / Hard

Refactor API error handling

Refactor a route handler so provider errors return typed client-safe messages while preserving server logs.

Top run: Frontier reasoning modelOpen trace

Scoring rubric

Patch correctness

Regression rate

Tool discipline

Review readiness

Run provenance

Generated at: 2026-05-16T00:00:00+05:30

Dataset version: 0.1.0

Trace ingest paths: data/benchmark-trace-input.json, data/benchmark-trace-runs.csv

Run date: 2026-05-16

Synthetic v0.1 benchmark dataset for website scaffolding. Aggregate suite rows are generated from script constants; task traces are ingested from data/benchmark-trace-input.json and data/benchmark-trace-runs.csv when present. Replace both with real model/provider runs as benchmark harnesses come online.

Leaderboard control metadata

Controls attached to this benchmark run

These fields make the suite auditable: the public/private split, freshness policy, leakage policy, repeat-run rule, retirement trigger, and provenance fields are generated with the benchmark data instead of being described only in prose.

Split

8 public / 12 private holdout tasks

Public share

40%

Holdout share

60%

Repeat rule

Repeat any result within five points of a leaderboard boundary across at least three seeds.

Freshness

Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist.

Leakage policy

Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.

Retirement rule

Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.

Required provenance

traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus

Next data step

Replace this synthetic v0.1 run with real provider traces.

The page is wired to generated data already, including JSON task packets and CSV trace rows. The next engineering task is to point the importer at actual benchmark harness exports with model name, provider, settings, latency samples, retries, tool traces, and reviewer notes.

Read methodology