Frontier API provider
Frontier reasoning model
Accepted
Good artifact proof and extraction.
Answer excerpt
Downloaded the newest invoice and verified artifact state before extracting the number.
Failure reason
Minor delay only.
Navigation / public trace / Medium
Navigate a mock vendor portal; download the latest invoice; verify file state; and extract the invoice number.
Expected evidence
Scoring focus
Common failure mode
Weak browser agents claim success after clicking download without proving the file exists.
Expected output
A browser-state proof with downloaded artifact name; invoice number; and confirmation screenshot.
Score breakdown
Trace provenance
Score calculation ledger
Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.
Model version
frontier-reasoning-eval-public-2026-05
Run seed
2026051720
Prompt packet
invoice-portal-download-public-packet-v0.1
Artifact bundle
Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.
Replay command
pnpm benchmarks:replay --suite browser-operations-suite --task invoice-portal-download
This command is intentionally documented before the real harness exists so the artifact contract is visible.
Payload preview
Split
public
Difficulty
Medium
Evidence fields
3
Model runs
4
Screenshot
Pending real browser or app screenshot artifact.
Model run evidence
This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.
Frontier API provider
Accepted
Good artifact proof and extraction.
Answer excerpt
Downloaded the newest invoice and verified artifact state before extracting the number.
Failure reason
Minor delay only.
Fast hosted API provider
Accepted with review
Useful but needed stronger state verification.
Answer excerpt
Clicked the correct invoice and extracted its number.
Failure reason
Missing explicit artifact-state proof.
Self-hosted/open-weight stack
Partial
Navigation worked but date selection failed.
Answer excerpt
Reached the invoice list but selected the older invoice.
Failure reason
Wrong artifact.
Low-cost routing endpoint
Rejected
Routing only.
Answer excerpt
Invoice task detected.
Failure reason
No browser action.
Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.
Return to suite report