Frontier API provider
Frontier reasoning model
Accepted
Good scoped test and clean browser teardown.
Answer excerpt
Added one targeted browser test with context cleanup and reran the local verifier.
Failure reason
No major issue.
Frontend QA / public trace / Medium
Add a bounded Playwright smoke test for Command-K search and verify it does not create runaway memory usage.
Expected evidence
Scoring focus
Common failure mode
Weak agents add broad end-to-end tests that duplicate the verifier and slow local runs.
Expected output
A scoped test that opens search; types a query; navigates to a result; and exits cleanly.
Score breakdown
Trace provenance
Score calculation ledger
Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.
Model version
frontier-reasoning-eval-public-2026-05
Run seed
2026051700
Prompt packet
add-playwright-smoke-test-public-packet-v0.1
Artifact bundle
Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.
Replay command
pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task add-playwright-smoke-test
This command is intentionally documented before the real harness exists so the artifact contract is visible.
Payload preview
Split
public
Difficulty
Medium
Evidence fields
3
Model runs
4
Screenshot
Pending real browser or app screenshot artifact.
Model run evidence
This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.
Frontier API provider
Accepted
Good scoped test and clean browser teardown.
Answer excerpt
Added one targeted browser test with context cleanup and reran the local verifier.
Failure reason
No major issue.
Fast hosted API provider
Accepted with review
Useful test; cleanup needed.
Answer excerpt
Added a search navigation check but initially left the browser open.
Failure reason
Reviewer requested explicit cleanup.
Self-hosted/open-weight stack
Partial
Understood intent but did not automate the smoke path.
Answer excerpt
Added a manual test note instead of an executable check.
Failure reason
No executable browser proof.
Low-cost routing endpoint
Rejected
Too shallow for coding-agent benchmark.
Answer excerpt
Search should be tested.
Failure reason
No patch.
Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.
Return to suite report