Back to Studio

Studio / Research

Browser Agent Evaluation Kit

Browser-agent tasks for navigation, form filling, extraction, screenshot QA, and resilient recovery from UI changes.

Live Studio demo

Browser agent evaluation kit

Evaluate whether a browser agent actually reached the target state: navigation, extraction, recovery, screenshot evidence, and handoff risk.

Extraction / Medium

Pricing page extraction

Navigate a provider pricing page, extract input/output token prices, and return source-linked structured data.

navigation success
evidence extraction
state verification

State proof

67

Recovery

50

Screenshot

74

Handoff risk

41

Browser run comparison

Browser agents extract stale snippets or fail to distinguish input, cached input, and output pricing.

Browser Operations Suite

Frontier reasoning model73
Accepted / 6150msMinor formatting cleanup needed.
Fast mid-tier model62
Accepted with review / 3520msMissed cached-input distinction.
Open-weight local model44
Partial / 5480msMixed plan names with API prices.
Small routing model28
Rejected / 1760msCould classify the page type only.

Evidence required

pricing table
source URL
structured output

Top trace proof

Returned input, cached-input, and output prices with the source URL preserved for review.

open pricing pageextract pricing tablenormalize fields

Deployment readout

Strict proof is enabled: this browser workflow should remain in supervised mode until confirmation-state checks and screenshot capture are stable.

How it works

Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.

Current demo state

Research kit with browser-operation scoring; real authenticated workflow packs can be built for consulting clients.

1

Define realistic browser jobs with target URL, expected state, blocked shortcuts, and screenshot evidence.

2

Run agents through navigation, extraction, form fill, confirmation, and recovery scenarios.

3

Capture DOM state, screenshot checks, console warnings, timeout behavior, and human handoff points.

4

Publish a robustness report by site pattern rather than a single aggregate browser score.

Buyer questions

These are the questions the product needs to answer before someone deploys, buys, or scales the system.

Can the agent prove the page reached the right state?

What happens when a modal or validation error appears?

Which workflows are stable enough for automation?

Where should a human remain in the loop?

Deliverables

What a buyer gets

Browser task report
Screenshot evidence
Selector fragility map
Handoff recommendation

Studio packet

Buyer-ready demo packet.

This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.

Next build step

Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.