# Browser Operations Suite Buyer Brief

Generated: 2026-05-16T00:00:00+05:30
Dataset: 0.1.0
Status: Research

## Executive Readout

- Average score: 51
- Current leader: Frontier reasoning model (71)
- Sample size: 16 tasks
- Public/private split: 6 public / 10 private holdout tasks
- Inspectable trace packets: 4

Usable for constrained workflows with fallback routing.

## Model Comparison

| Model class | Provider | Score | Pass rate | Recovery | Cost index | P95 latency |
| --- | --- | --- | --- | --- | --- | --- |
| Frontier reasoning model | Frontier API provider | 71 | 74 | 69 | 49 | 5764ms |
| Fast mid-tier model | Fast hosted API provider | 60 | 62 | 55 | 82 | 4874ms |
| Open-weight local model | Self-hosted/open-weight stack | 43 | 46 | 39 | 76 | 6000ms |
| Small routing model | Low-cost routing endpoint | 31 | 28 | 25 | 95 | 4732ms |

## Representative Trace Packets

| Task | Domain | Split | Difficulty | Top run | Score |
| --- | --- | --- | --- | --- | --- |
| Pricing page extraction | Extraction | public | Medium | Frontier reasoning model | 73 |
| Multi-step demo form | Form completion | holdout | Hard | Frontier reasoning model | 69 |
| Invoice portal download | Navigation | public | Medium | Frontier reasoning model | 72 |
| Competitor feature map | Extraction | holdout | Hard | Frontier reasoning model | 70 |

## Task Mix

| Category | Share |
| --- | --- |
| Navigation | 30% |
| Form completion | 22% |
| Extraction | 24% |
| Screenshot verification | 24% |

## Scoring Rubric

- Task success
- State verification
- Recovery quality
- Human handoff rate

## Leaderboard Controls

- Freshness: Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist.
- Leakage policy: Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.
- Repeat-run rule: Repeat any result within five points of a leaderboard boundary across at least three seeds.
- Retirement rule: Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.
- Required provenance: traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus

## Recommended Next Step

Use this brief to decide which workflow should become a real private eval run. Replace synthetic rows with harness exports that include raw prompts, exact model identifiers, latency samples, screenshots or tool logs, scorer identity, and replay links.

Contact: sanjay@edxperimentallabs.com or saujas@edxperimentallabs.com
