Indian Workflows
Designing the Indian Enterprise AI Workflow Benchmark
A concrete v0.1 benchmark plan for Indian business workflows: data shape, scoring, task refresh, and buyer-facing outputs.
Indian Workflows
A concrete v0.1 benchmark plan for Indian business workflows: data shape, scoring, task refresh, and buyer-facing outputs.
Research lens
6 domains
24 seed tasks
The most commercially useful AI benchmark is not a puzzle set. It is a battery of repeated business tasks: support escalation, invoice reconciliation, CRM cleanup, policy lookup, sales-call summarization, and bilingual document handling.
A task is ready for the benchmark only when the expected output, evidence requirement, failure conditions, and scoring rubric are written before the model run. Otherwise the evaluation becomes vibes-based review after the fact.
Indian workflows are useful because they expose multilingual switching, document variability, policy ambiguity, price sensitivity, and operational constraints. Those same properties matter to global buyers evaluating production AI systems.
Visual
Initial allocation for a private holdout plus public sample benchmark.
Support
Refund escalation with mixed Hindi-English context
Correct escalation and tone
Finance
GST invoice discrepancy explanation
Numerical accuracy and citation
Legal
Policy clause retrieval from long document
Grounded answer with source span
Sales
Call transcript to CRM fields
Structured output completeness
Documents
Form extraction across messy scans
Field-level precision and abstention
Process
First release
24 seed tasks
Enough to demonstrate methodology without pretending to be a finished universal benchmark.
Private split
60%
Keep the harder holdout private so the public sample remains useful but not gameable.
Buyer artifact
Decision memo
Every benchmark run should end with a provider/model recommendation and a risk register.
Recommendation
The right model, benchmark, or interpretability method depends on the workflow, risk tolerance, budget, latency target, data sensitivity, and the cost of a wrong answer.