Extraction / public trace / Medium

Pricing page extraction

Navigate a provider pricing page, extract input/output token prices, and return source-linked structured data.

Build this benchmark Read methodology

Expected evidence

pricing table

source URL

structured output

Scoring focus

navigation success

evidence extraction

state verification

Common failure mode

Browser agents extract stale snippets or fail to distinguish input, cached input, and output pricing.

Expected output

A source-linked table separating input, cached input, output, and batch pricing where available.

Score breakdown

Navigation25

Extraction35

Field separation25

Source proof15

Trace provenance

Can this public trace be audited later?

Trace id: trace-browser-operations-suite-pricing-page-extraction

Created: 2026-05-14

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Medium: public sample can become saturated after publication.

Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Navigation18/25

Extraction26/35

Field separation18/25

Source proof11/15

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051640

Prompt packet

pricing-page-extraction-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite browser-operations-suite --task pricing-page-extraction

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score73

Cost units4.7

Latency6150ms

Extracted the right fields and preserved source URL; minor formatting cleanup needed.

Answer excerpt

Returned input, cached-input, and output prices with the source URL preserved for review.

Failure reason

Minor formatting cleanup needed.

open pricing pageextract pricing tablenormalize fields

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score62

Cost units2

Latency3520ms

Reached page and extracted most prices but missed cached-input distinction.

Answer excerpt

Returned provider prices for input and output tokens.

Failure reason

Missed cached-input distinction.

open pricing pageextract visible table

Self-hosted/open-weight stack

Open-weight local model

Partial

Score44

Cost units1.4

Latency5480ms

Found pricing page but mixed plan names with API prices.

Answer excerpt

The page includes multiple API price plans.

Failure reason

Mixed plan names with API prices.

open pricing page

Low-cost routing endpoint

Small routing model

Rejected

Score28

Cost units0.4

Latency1760ms

Could classify the page type only.

Answer excerpt

Pricing page detected.

Failure reason

Could classify the page type only.

classify page

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report