Back to Studio

Studio / Preview

Agent Benchmark Explorer

A structured benchmark surface for measuring whether agents can plan, use tools, recover from errors, and complete useful work rather than only answer prompts.

Live Studio demo

Agent trace explorer

Inspect benchmark traces by task: expected evidence, common failure mode, model outcomes, latency, score, and tool-call pattern.

Finance / public / Medium

GST invoice discrepancy explanation

Given an invoice, purchase order, and short email thread, identify the GST mismatch and draft a vendor-facing explanation with cited evidence.

invoice tax line
purchase order rate
vendor email exception

Top model

Frontier reasoning model

Accepted runs

2/4

Avg latency

4155ms

Model trace ranking

Models often explain the mismatch correctly but cite the wrong source row or miss the exception email.

Indian Enterprise Workflow Suite

Frontier reasoning model91

Accepted

Correct mismatch, cited all three evidence points, and proposed a clean vendor follow-up.

Fast mid-tier model82

Accepted with review

Correct calculation and tone; reviewer had to repair one citation label.

Open-weight local model61

Partial

Found a mismatch but confused CGST/SGST allocation and needed human correction.

Small routing model38

Rejected

Useful triage signal only; generated an unsupported tax explanation.

Top answer excerpt

The invoice uses a GST rate that does not match the purchase order, and the exception email explains why the vendor needs to revise the tax line before payment.

Failure reason to watch

Minor formatting cleanup only.

Tool-call pattern

parse invoice tablecompare PO tax ratedraft vendor note

How it works

Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.

Current demo state

Preview dashboard with synthetic benchmark traces; real trace ingestion is the next build step.

1

Select a workflow track: support, browser operations, coding maintenance, document work, or internal tool use.

2

Load task packets with expected outcomes, allowed tools, private holdouts, and reviewer rubrics.

3

Run agent traces with tool calls, recovery attempts, timing, retries, and cost captured automatically.

4

Publish a buyer-readable scorecard with strengths, failure modes, and deployment recommendation.

Buyer questions

These are the questions the product needs to answer before someone deploys, buys, or scales the system.

Can the agent recover after a bad tool call?

Does it verify state before claiming completion?

How much does each accepted workflow actually cost?

Which failures should trigger human handoff?

Deliverables

What a buyer gets

Agent scorecard
Trace review table
Failure taxonomy
Cost per resolved task

Studio packet

Buyer-ready demo packet.

This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.

Next build step

Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.