Back to Browser Operations Suite

Navigation / public trace / Medium

Invoice portal download

Navigate a mock vendor portal; download the latest invoice; verify file state; and extract the invoice number.

Expected evidence

downloaded file
invoice number
confirmation state

Scoring focus

browser state
artifact proof
extraction

Common failure mode

Weak browser agents claim success after clicking download without proving the file exists.

Expected output

A browser-state proof with downloaded artifact name; invoice number; and confirmation screenshot.

Score breakdown

Navigation25
Download proof30
Extraction25
Recovery20

Trace provenance

Can this public trace be audited later?

Trace id: trace-browser-operations-suite-invoice-portal-download
Created: 2026-05-22
Last reviewed: 2026-05-16
Source: data/benchmark-trace-runs.csv
Leakage risk: Medium: public sample can become saturated after publication.
Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Navigation18/25
Download proof22/30
Extraction18/25
Recovery14/20

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051720

Prompt packet

invoice-portal-download-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Replay command

pnpm benchmarks:replay --suite browser-operations-suite --task invoice-portal-download

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

3

Model runs

4

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score72
Cost units4.8
Latency6410ms

Good artifact proof and extraction.

Answer excerpt

Downloaded the newest invoice and verified artifact state before extracting the number.

Failure reason

Minor delay only.

navigate portaldownload fileverify artifact

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score61
Cost units2.1
Latency3710ms

Useful but needed stronger state verification.

Answer excerpt

Clicked the correct invoice and extracted its number.

Failure reason

Missing explicit artifact-state proof.

navigate portaldownload file

Self-hosted/open-weight stack

Open-weight local model

Partial

Score42
Cost units1.4
Latency5660ms

Navigation worked but date selection failed.

Answer excerpt

Reached the invoice list but selected the older invoice.

Failure reason

Wrong artifact.

navigate portal

Low-cost routing endpoint

Small routing model

Rejected

Score27
Cost units0.4
Latency1810ms

Routing only.

Answer excerpt

Invoice task detected.

Failure reason

No browser action.

classify workflow

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report