Frontend / public trace / Medium

Fix Command-K search regression

Given a Next.js app where Command-K opens but result links do not navigate, patch the client component and verify keyboard and click behavior.

Build this benchmark Read methodology

Expected evidence

patched client component

route navigation proof

no console warnings

Scoring focus

patch correctness

test discipline

review readiness

Common failure mode

Agents often fix click navigation but forget keyboard shortcut or leave focus trapped after navigation.

Expected output

A scoped patch that preserves keyboard open behavior, makes result navigation work, closes the palette after navigation, and verifies the flow with Playwright.

Score breakdown

Patch correctness40

Keyboard behavior20

Verification25

Diff hygiene15

Trace provenance

Can this public trace be audited later?

Trace id: trace-coding-agent-maintenance-suite-fix-command-k-search-regression

Created: 2026-05-12

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Medium: public sample can become saturated after publication.

Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Patch correctness31/40

Keyboard behavior16/20

Verification19/25

Diff hygiene12/15

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051620

Prompt packet

fix-command-k-search-regression-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task fix-command-k-search-regression

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score78

Cost units5

Latency7060ms

Patched navigation, preserved focus behavior, and verified with Playwright.

Answer excerpt

Updated the result link path and verified Cmd-K, click navigation, URL change, and absence of console warnings.

Failure reason

No major issue.

read componentpatch client navigationrun Playwright verification

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score66

Cost units2.2

Latency4020ms

Fixed click path but needed reviewer prompt to add keyboard coverage.

Answer excerpt

Fixed result click navigation and confirmed the target route loaded.

Failure reason

Initial patch skipped keyboard coverage.

read componentpatch link behavior

Self-hosted/open-weight stack

Open-weight local model

Partial

Score52

Cost units1.5

Latency6390ms

Found the component but changed unrelated styles and skipped verification.

Answer excerpt

Changed the search component and adjusted palette styles.

Failure reason

Unrelated style churn and no browser verification.

read component

Low-cost routing endpoint

Small routing model

Rejected

Score29

Cost units0.4

Latency2140ms

Suggested a fix without reading the component boundary.

Answer excerpt

Try replacing the search button with a link.

Failure reason

Suggested a fix without reading the component boundary.

classify issue

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report