Merge readiness
86%
Combines patch correctness, scoped diff behavior, and accepted proof.
Studio / Research
A coding-agent evaluation track for repository edits, bug fixes, browser checks, terminal usage, and regression discipline.
Live Studio demo
Score coding agents by reviewable work: narrow diagnosis, patch hygiene, command evidence, browser checks, and failure honesty.
Given a Next.js app where Command-K opens but result links do not navigate, patch the client component and verify keyboard and click behavior.
Merge readiness
86%
Combines patch correctness, scoped diff behavior, and accepted proof.
Regression guard
83%
Rewards build, lint, test, and browser verification evidence.
Tool discipline
60%
Checks whether the agent read, patched, and verified in the right order.
Review load
49%
Lower is better; this estimates the human cleanup required.
Agent run ranking
Frontier reasoning model
78/100
Accepted
Patched navigation, preserved focus behavior, and verified with Playwright.
Fast mid-tier model
66/100
Accepted with review
Fixed click path but needed reviewer prompt to add keyboard coverage.
Open-weight local model
52/100
Partial
Found the component but changed unrelated styles and skipped verification.
Small routing model
29/100
Rejected
Suggested a fix without reading the component boundary.
Acceptance evidence
A scoped patch that preserves keyboard open behavior, makes result navigation work, closes the palette after navigation, and verifies the flow with Playwright.
Arena verdict
Merge-ready only when the patch, terminal checks, and browser proof agree.
Best excerpt: Updated the result link path and verified Cmd-K, click navigation, URL change, and absence of console warnings.
Known failure mode
Agents often fix click navigation but forget keyboard shortcut or leave focus trapped after navigation.
Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.
Current demo state
Live coding arena is connected to the Coding Agent Maintenance Suite; next step is importing real agent patches, logs, and review artifacts.
Create issue-style tasks with starter repository state, acceptance criteria, and forbidden unrelated changes.
Run agents through code reading, implementation, tests, lint, browser verification, and final review notes.
Score correctness, regression risk, repository discipline, and whether the submitted patch is actually reviewable.
Compare agents by merge readiness instead of raw lines changed or benchmark pass rate alone.
These are the questions the product needs to answer before someone deploys, buys, or scales the system.
Can this agent work inside our existing codebase?
Does it respect ownership boundaries and avoid unrelated churn?
Can it debug failing tests without hiding the failure?
What tasks are safe to delegate today?
Deliverables
Connected evidence
Studio packet
This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.
Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.