Back to Studio

Studio / Designing v0.1

Indian Workflow Benchmark

A workflow benchmark for Indian business tasks: finance, support, multilingual handoffs, document reasoning, sales ops, and evidence-grounded escalation.

Live Studio demo

Indian workflow console

Evaluate AI on workflows that resemble real Indian operations: GST evidence, bilingual tickets, policy decisions, and escalation boundaries.

FinanceMediumpublic

GST invoice discrepancy explanation

Given an invoice, purchase order, and short email thread, identify the GST mismatch and draft a vendor-facing explanation with cited evidence.

Evidence coverage

94%

Checks whether the answer cites the right operational artifacts.

Escalation judgement

96%

Separates safe automation from tasks requiring a human queue.

Localization load

92%

Captures mixed-language, regional, and policy-language pressure.

Cost risk

70%

Lower is better; cost rises with retries, reviews, and high-cost routes.

Task mix v0.1

Support20%
Finance18%
Legal14%
Sales16%
Documents20%
Multilingual12%

Model comparison

Who survives the workflow?

Frontier reasoning model

91/100

Correct mismatch, cited all three evidence points, and proposed a clean vendor follow-up.

Fast mid-tier model

82/100

Correct calculation and tone; reviewer had to repair one citation label.

Open-weight local model

61/100

Found a mismatch but confused CGST/SGST allocation and needed human correction.

Small routing model

38/100

Useful triage signal only; generated an unsupported tax explanation.

Expected evidence

invoice tax linepurchase order ratevendor email exception

Best answer excerpt

The invoice uses a GST rate that does not match the purchase order, and the exception email explains why the vendor needs to revise the tax line before payment.

Benchmark readiness

Holdout-pressure mode is on: publish aggregate scores, keep exact private tasks sealed, and require cited evidence before deployment.

Failure mode to watch: Models often explain the mismatch correctly but cite the wrong source row or miss the exception email.

How it works

Each Studio surface is designed as a practical operating loop: capture the buyer problem, run measured evidence, and return a decision artifact that can be acted on.

Current demo state

Live benchmark console is connected to the Indian Enterprise Workflow Suite; next step is replacing seed traces with real client-approved examples.

1

Build public and private workflow packets from Indian enterprise task patterns.

2

Run model and agent classes through finance, support, sales, legal, document, and multilingual scenarios.

3

Score outcome correctness, cited evidence, escalation judgement, localization robustness, latency, and cost.

4

Publish aggregate findings while keeping private holdout tasks unavailable to model tuning loops.

Buyer questions

These are the questions the product needs to answer before someone deploys, buys, or scales the system.

Which models survive Indian document and support workflows?

Where do multilingual or policy tasks fail?

What can be safely automated versus escalated?

How do quality, latency, and cost change by workflow type?

Deliverables

What a buyer gets

Workflow task pack
Model comparison memo
Evidence audit
Deployment readiness map

Studio packet

Buyer-ready demo packet.

This generated packet gives Sanjay and Saujas a consistent follow-up artifact for demos, consulting calls, and product conversations.

Next build step

Turn this Studio surface from a populated product brief into a live demo by wiring real run data, screenshots, and client-approved examples into the same page.