Form completion / holdout trace / Hard

Multi-step demo form

Complete a multi-step demo request form, handle validation errors, and verify the confirmation state with screenshot evidence.

Build this benchmark Read methodology

Expected evidence

form state

validation recovery

confirmation screenshot

Scoring focus

form completion

recovery quality

screenshot verification

Common failure mode

Agents claim success after submit click even when the form remains on a validation-error state.

Expected output

A confirmed demo-request submission with evidence that the final confirmation state is visible.

Score breakdown

Form completion30

Validation recovery30

State verification25

Handoff safety15

Trace provenance

Can this public trace be audited later?

Trace id: trace-browser-operations-suite-multi-step-demo-form

Created: 2026-05-15

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Low: holdout task is not published in full.

Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Form completion21/30

Validation recovery21/30

State verification17/25

Handoff safety10/15

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051650

Prompt packet

multi-step-demo-form-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite browser-operations-suite --task multi-step-demo-form

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted with review

Score69

Cost units5.1

Latency6980ms

Recovered from one validation error and captured final state; slow but reliable.

Answer excerpt

Submitted the form after correcting one validation error and captured final confirmation state.

Failure reason

Slow but reliable.

fill formhandle validationcapture confirmation

Fast hosted API provider

Fast mid-tier model

Partial

Score58

Cost units2.2

Latency3840ms

Completed visible fields but missed a hidden required dropdown.

Answer excerpt

Filled the visible fields and clicked submit.

Failure reason

Missed a hidden required dropdown.

fill visible fieldsclick submit

Self-hosted/open-weight stack

Open-weight local model

Rejected

Score41

Cost units1.5

Latency5920ms

Failed to recover after validation error.

Answer excerpt

The form could not be submitted due to validation errors.

Failure reason

Failed to recover after validation error.

fill visible fieldsclick submit

Low-cost routing endpoint

Small routing model

Rejected

Score24

Cost units0.4

Latency1920ms

Not suitable for stateful browser execution.

Answer excerpt

This is a browser form task.

Failure reason

Not suitable for stateful browser execution.

classify task

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report