Studio / Preview

Agent Benchmark Explorer

A structured benchmark surface for measuring whether agents can plan, use tools, recover from errors, and complete useful work rather than only answer prompts.

Request a demo View benchmark lab

Live Studio demo

Agent trace explorer

Inspect benchmark traces by task: expected evidence, common failure mode, model outcomes, latency, score, and tool-call pattern.

Benchmark suiteTrace packet

Finance / public / Medium

GST invoice discrepancy explanation

Given an invoice, purchase order, and short email thread, identify the GST mismatch and draft a vendor-facing explanation with cited evidence.

invoice tax line

purchase order rate

vendor email exception

Top model

Frontier reasoning model

Accepted runs

2/4

Avg latency

4155ms

Model trace ranking

Models often explain the mismatch correctly but cite the wrong source row or miss the exception email.

Indian Enterprise Workflow Suite

Frontier reasoning model91

Accepted

Correct mismatch, cited all three evidence points, and proposed a clean vendor follow-up.

Fast mid-tier model82

Accepted with review

Correct calculation and tone; reviewer had to repair one citation label.

Open-weight local model61

Partial

Found a mismatch but confused CGST/SGST allocation and needed human correction.

Small routing model38

Rejected

Useful triage signal only; generated an unsupported tax explanation.

How it works

Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.

Current demo state

Preview dashboard with synthetic benchmark traces; real trace ingestion is the next build step.

Select a workflow track: support, browser operations, coding maintenance, document work, or internal tool use.

Load task packets with expected outcomes, allowed tools, private holdouts, and reviewer rubrics.

Run agent traces with tool calls, recovery attempts, timing, retries, and cost captured automatically.

Publish a buyer-readable scorecard with strengths, failure modes, and deployment recommendation.

Buyer questions

These are the questions the product needs to answer before someone deploys, buys, or scales the system.

Can the agent recover after a bad tool call?

Does it verify state before claiming completion?

How much does each accepted workflow actually cost?

Which failures should trigger human handoff?

Deliverables

What a buyer gets

Agent scorecard

Trace review table

Failure taxonomy

Cost per resolved task

Connected evidence

Read the benchmark trail

Agentic Reliability Index Browser Operations Suite Agent benchmarks article

Studio packet

Buyer-ready demo packet.

This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.

Markdown packetAgent Benchmark Explorer brief for client follow-up.Structured JSONMachine-readable fields for future CRM or Studio workflows.

Next build step

Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.

Start with this product See consulting process