Extraction / holdout trace / Hard

Competitor feature map

Visit three product pages; extract pricing tiers and feature gates; normalize them into a comparable table.

Build this benchmark Read methodology

Expected evidence

source URLs

tier names

feature gates

Scoring focus

multi-page extraction

normalization

source proof

Common failure mode

Weak browser agents copy marketing bullets without normalizing feature availability by tier.

Expected output

A normalized comparison table with source URLs; tier names; gated features; and extraction confidence.

Score breakdown

Extraction30

Normalization30

Source proof25

Confidence15

Trace provenance

Can this public trace be audited later?

Trace id: trace-browser-operations-suite-competitor-feature-map

Created: 2026-05-23

Last reviewed: 2026-05-16

Source: data/benchmark-trace-runs.csv

Leakage risk: Low: holdout task is not published in full.

Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Extraction21/30

Normalization21/30

Source proof18/25

Confidence10/15

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051730

Prompt packet

competitor-feature-map-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite browser-operations-suite --task competitor-feature-map

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted with review

Score70

Cost units5

Latency6890ms

Strong multi-page extraction with minor label cleanup.

Answer excerpt

Created a normalized tier table and preserved source URLs for all pages.

Failure reason

One confidence label needed review.

visit pagesextract tablesnormalize features

Fast hosted API provider

Fast mid-tier model

Partial

Score57

Cost units2.2

Latency4080ms

Incomplete but recoverable.

Answer excerpt

Collected feature bullets from two of three pages.

Failure reason

Missed one page and skipped normalization.

visit pagesextract bullets

Self-hosted/open-weight stack

Open-weight local model

Rejected

Score39

Cost units1.5

Latency6040ms

Could not maintain state across pages.

Answer excerpt

Summarized the first page only.

Failure reason

Failed multi-page workflow.

visit first page

Low-cost routing endpoint

Small routing model

Rejected

Score23

Cost units0.4

Latency1960ms

Classification only.

Answer excerpt

Competitive research task.

Failure reason

No execution.

classify workflow

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report