Studio / Research

Browser Agent Evaluation Kit

Browser-agent tasks for navigation, form filling, extraction, screenshot QA, and resilient recovery from UI changes.

Live Studio demo

Browser agent evaluation kit

Evaluate whether a browser agent actually reached the target state: navigation, extraction, recovery, screenshot evidence, and handoff risk.

Browser scenarioStrict state proofRequire confirmation text, URL/state match, and screenshot evidence before accepting completion.

Extraction / Medium

Pricing page extraction

Navigate a provider pricing page, extract input/output token prices, and return source-linked structured data.

navigation success

evidence extraction

state verification

State proof

Recovery

Screenshot

Handoff risk

Browser run comparison

Browser agents extract stale snippets or fail to distinguish input, cached input, and output pricing.

Browser Operations Suite

Frontier reasoning model73

Accepted / 6150msMinor formatting cleanup needed.

Fast mid-tier model62

Accepted with review / 3520msMissed cached-input distinction.

Open-weight local model44

Partial / 5480msMixed plan names with API prices.

Small routing model28

Rejected / 1760msCould classify the page type only.

Evidence required

pricing table

source URL

structured output

Top trace proof

Returned input, cached-input, and output prices with the source URL preserved for review.

open pricing pageextract pricing tablenormalize fields

Deployment readout

Strict proof is enabled: this browser workflow should remain in supervised mode until confirmation-state checks and screenshot capture are stable.

How it works

Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.

Current demo state

Research kit with browser-operation scoring; real authenticated workflow packs can be built for consulting clients.

Define realistic browser jobs with target URL, expected state, blocked shortcuts, and screenshot evidence.

Run agents through navigation, extraction, form fill, confirmation, and recovery scenarios.

Capture DOM state, screenshot checks, console warnings, timeout behavior, and human handoff points.

Publish a robustness report by site pattern rather than a single aggregate browser score.

Buyer questions

These are the questions the product needs to answer before someone deploys, buys, or scales the system.

Can the agent prove the page reached the right state?

What happens when a modal or validation error appears?

Which workflows are stable enough for automation?

Where should a human remain in the loop?

Deliverables

What a buyer gets

Browser task report

Screenshot evidence

Selector fragility map

Handoff recommendation

Connected evidence

Read the benchmark trail

Browser Operations Suite Agent benchmarks article Studio request form

Studio packet

Buyer-ready demo packet.

This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.

Markdown packetBrowser Agent Evaluation Kit brief for client follow-up.Structured JSONMachine-readable fields for future CRM or Studio workflows.

Next build step

Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.

Start with this product See consulting process