Back to Studio

Studio / Research

Coding Agent Arena

A coding-agent evaluation track for repository edits, bug fixes, browser checks, terminal usage, and regression discipline.

Live Studio demo

Coding arena console

Score coding agents by reviewable work: narrow diagnosis, patch hygiene, command evidence, browser checks, and failure honesty.

FrontendMediumpublic

Fix Command-K search regression

Given a Next.js app where Command-K opens but result links do not navigate, patch the client component and verify keyboard and click behavior.

Merge readiness

86%

Combines patch correctness, scoped diff behavior, and accepted proof.

Regression guard

83%

Rewards build, lint, test, and browser verification evidence.

Tool discipline

60%

Checks whether the agent read, patched, and verified in the right order.

Review load

49%

Lower is better; this estimates the human cleanup required.

Agent run ranking

Which patch should a reviewer open?

Frontier reasoning model

78/100

Accepted

Patched navigation, preserved focus behavior, and verified with Playwright.

Fast mid-tier model

66/100

Accepted with review

Fixed click path but needed reviewer prompt to add keyboard coverage.

Open-weight local model

52/100

Partial

Found the component but changed unrelated styles and skipped verification.

Small routing model

29/100

Rejected

Suggested a fix without reading the component boundary.

Acceptance evidence

patched client componentroute navigation proofno console warnings

A scoped patch that preserves keyboard open behavior, makes result navigation work, closes the palette after navigation, and verifies the flow with Playwright.

Arena verdict

Merge-ready only when the patch, terminal checks, and browser proof agree.

Best excerpt: Updated the result link path and verified Cmd-K, click navigation, URL change, and absence of console warnings.

read componentpatch client navigationrun Playwright verification

Known failure mode

Agents often fix click navigation but forget keyboard shortcut or leave focus trapped after navigation.

How it works

Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.

Current demo state

Live coding arena is connected to the Coding Agent Maintenance Suite; next step is importing real agent patches, logs, and review artifacts.

1

Create issue-style tasks with starter repository state, acceptance criteria, and forbidden unrelated changes.

2

Run agents through code reading, implementation, tests, lint, browser verification, and final review notes.

3

Score correctness, regression risk, repository discipline, and whether the submitted patch is actually reviewable.

4

Compare agents by merge readiness instead of raw lines changed or benchmark pass rate alone.

Buyer questions

These are the questions the product needs to answer before someone deploys, buys, or scales the system.

Can this agent work inside our existing codebase?

Does it respect ownership boundaries and avoid unrelated churn?

Can it debug failing tests without hiding the failure?

What tasks are safe to delegate today?

Deliverables

What a buyer gets

Patch review
Regression report
Tool-use transcript
Merge-readiness score

Studio packet

Buyer-ready demo packet.

This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.

Next build step

Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.