Frontend QA / public trace / Medium

Add Playwright smoke test

Add a bounded Playwright smoke test for Command-K search and verify it does not create runaway memory usage.

Build this benchmark Read methodology

Expected evidence

new smoke test

bounded browser context

clean verifier run

Scoring focus

browser verification

resource discipline

search behavior

Common failure mode

Weak agents add broad end-to-end tests that duplicate the verifier and slow local runs.

Expected output

A scoped test that opens search; types a query; navigates to a result; and exits cleanly.

Score breakdown

Test coverage35

Scope25

Runtime discipline25

Verification15

Trace provenance

Can this public trace be audited later?

Trace id: trace-coding-agent-maintenance-suite-add-playwright-smoke-test

Created: 2026-05-20

Last reviewed: 2026-05-16

Source: data/benchmark-trace-runs.csv

Leakage risk: Medium: public sample can become saturated after publication.

Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Test coverage27/35

Scope19/25

Runtime discipline19/25

Verification12/15

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051700

Prompt packet

add-playwright-smoke-test-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task add-playwright-smoke-test

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score77

Cost units5.1

Latency7140ms

Good scoped test and clean browser teardown.

Answer excerpt

Added one targeted browser test with context cleanup and reran the local verifier.

Failure reason

No major issue.

read verifieradd testrun playwright

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score68

Cost units2.3

Latency4210ms

Useful test; cleanup needed.

Answer excerpt

Added a search navigation check but initially left the browser open.

Failure reason

Reviewer requested explicit cleanup.

add testrun playwright

Self-hosted/open-weight stack

Open-weight local model

Partial

Score49

Cost units1.5

Latency6310ms

Understood intent but did not automate the smoke path.

Answer excerpt

Added a manual test note instead of an executable check.

Failure reason

No executable browser proof.

write note

Low-cost routing endpoint

Small routing model

Rejected

Score26

Cost units0.4

Latency2050ms

Too shallow for coding-agent benchmark.

Answer excerpt

Search should be tested.

Failure reason

No patch.

classify issue

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report