Can the agent recover after a bad tool call?
Studio / Preview
Agent Benchmark Explorer
A structured benchmark surface for measuring whether agents can plan, use tools, recover from errors, and complete useful work rather than only answer prompts.
Live Studio demo
Agent trace explorer
Inspect benchmark traces by task: expected evidence, common failure mode, model outcomes, latency, score, and tool-call pattern.
Finance / public / Medium
GST invoice discrepancy explanation
Given an invoice, purchase order, and short email thread, identify the GST mismatch and draft a vendor-facing explanation with cited evidence.
Top model
Frontier reasoning model
Accepted runs
2/4
Avg latency
4155ms
Model trace ranking
Models often explain the mismatch correctly but cite the wrong source row or miss the exception email.
Indian Enterprise Workflow Suite
Accepted
Correct mismatch, cited all three evidence points, and proposed a clean vendor follow-up.
Accepted with review
Correct calculation and tone; reviewer had to repair one citation label.
Partial
Found a mismatch but confused CGST/SGST allocation and needed human correction.
Rejected
Useful triage signal only; generated an unsupported tax explanation.
Top answer excerpt
The invoice uses a GST rate that does not match the purchase order, and the exception email explains why the vendor needs to revise the tax line before payment.
Failure reason to watch
Minor formatting cleanup only.
Tool-call pattern
How it works
Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.
Current demo state
Preview dashboard with synthetic benchmark traces; real trace ingestion is the next build step.
Select a workflow track: support, browser operations, coding maintenance, document work, or internal tool use.
Load task packets with expected outcomes, allowed tools, private holdouts, and reviewer rubrics.
Run agent traces with tool calls, recovery attempts, timing, retries, and cost captured automatically.
Publish a buyer-readable scorecard with strengths, failure modes, and deployment recommendation.
Buyer questions
These are the questions the product needs to answer before someone deploys, buys, or scales the system.
Does it verify state before claiming completion?
How much does each accepted workflow actually cost?
Which failures should trigger human handoff?
Deliverables
What a buyer gets
Connected evidence
Read the benchmark trail
Studio packet
Buyer-ready demo packet.
This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.
Next build step
Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.