Evidence coverage
94%
Checks whether the answer cites the right operational artifacts.
Studio / Designing v0.1
A workflow benchmark for Indian business tasks: finance, support, multilingual handoffs, document reasoning, sales ops, and evidence-grounded escalation.
Live Studio demo
Evaluate AI on workflows that resemble real Indian operations: GST evidence, bilingual tickets, policy decisions, and escalation boundaries.
Given an invoice, purchase order, and short email thread, identify the GST mismatch and draft a vendor-facing explanation with cited evidence.
Evidence coverage
94%
Checks whether the answer cites the right operational artifacts.
Escalation judgement
96%
Separates safe automation from tasks requiring a human queue.
Localization load
92%
Captures mixed-language, regional, and policy-language pressure.
Cost risk
70%
Lower is better; cost rises with retries, reviews, and high-cost routes.
Task mix v0.1
Model comparison
Frontier reasoning model
91/100
Correct mismatch, cited all three evidence points, and proposed a clean vendor follow-up.
Fast mid-tier model
82/100
Correct calculation and tone; reviewer had to repair one citation label.
Open-weight local model
61/100
Found a mismatch but confused CGST/SGST allocation and needed human correction.
Small routing model
38/100
Useful triage signal only; generated an unsupported tax explanation.
Expected evidence
Best answer excerpt
The invoice uses a GST rate that does not match the purchase order, and the exception email explains why the vendor needs to revise the tax line before payment.
Benchmark readiness
Holdout-pressure mode is on: publish aggregate scores, keep exact private tasks sealed, and require cited evidence before deployment.
Failure mode to watch: Models often explain the mismatch correctly but cite the wrong source row or miss the exception email.
Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.
Current demo state
Live benchmark console is connected to the Indian Enterprise Workflow Suite; next step is replacing seed traces with real client-approved examples.
Build public and private workflow packets from Indian enterprise task patterns.
Run model and agent classes through finance, support, sales, legal, document, and multilingual scenarios.
Score outcome correctness, cited evidence, escalation judgement, localization robustness, latency, and cost.
Publish aggregate findings while keeping private holdout tasks unavailable to model tuning loops.
These are the questions the product needs to answer before someone deploys, buys, or scales the system.
Which models survive Indian document and support workflows?
Where do multilingual or policy tasks fail?
What can be safely automated versus escalated?
How do quality, latency, and cost change by workflow type?
Deliverables
Connected evidence
Studio packet
This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.
Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.