Back to Coding Agent Maintenance Suite

Frontend QA / public trace / Medium

Add Playwright smoke test

Add a bounded Playwright smoke test for Command-K search and verify it does not create runaway memory usage.

Expected evidence

new smoke test
bounded browser context
clean verifier run

Scoring focus

browser verification
resource discipline
search behavior

Common failure mode

Weak agents add broad end-to-end tests that duplicate the verifier and slow local runs.

Expected output

A scoped test that opens search; types a query; navigates to a result; and exits cleanly.

Score breakdown

Test coverage35
Scope25
Runtime discipline25
Verification15

Trace provenance

Can this public trace be audited later?

Trace id: trace-coding-agent-maintenance-suite-add-playwright-smoke-test
Created: 2026-05-20
Last reviewed: 2026-05-16
Source: data/benchmark-trace-runs.csv
Leakage risk: Medium: public sample can become saturated after publication.
Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Test coverage27/35
Scope19/25
Runtime discipline19/25
Verification12/15

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051700

Prompt packet

add-playwright-smoke-test-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Replay command

pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task add-playwright-smoke-test

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

3

Model runs

4

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score77
Cost units5.1
Latency7140ms

Good scoped test and clean browser teardown.

Answer excerpt

Added one targeted browser test with context cleanup and reran the local verifier.

Failure reason

No major issue.

read verifieradd testrun playwright

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score68
Cost units2.3
Latency4210ms

Useful test; cleanup needed.

Answer excerpt

Added a search navigation check but initially left the browser open.

Failure reason

Reviewer requested explicit cleanup.

add testrun playwright

Self-hosted/open-weight stack

Open-weight local model

Partial

Score49
Cost units1.5
Latency6310ms

Understood intent but did not automate the smoke path.

Answer excerpt

Added a manual test note instead of an executable check.

Failure reason

No executable browser proof.

write note

Low-cost routing endpoint

Small routing model

Rejected

Score26
Cost units0.4
Latency2050ms

Too shallow for coding-agent benchmark.

Answer excerpt

Search should be tested.

Failure reason

No patch.

classify issue

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report