Indian Workflows

Designing the Indian Enterprise AI Workflow Benchmark

A concrete v0.1 benchmark plan for Indian business workflows: data shape, scoring, task refresh, and buyer-facing outputs.

20 May 2026/16 min read

Research lens

6 domains

24 seed tasks

The benchmark should start with boring work

The most commercially useful AI benchmark is not a puzzle set. It is a battery of repeated business tasks: support escalation, invoice reconciliation, CRM cleanup, policy lookup, sales-call summarization, and bilingual document handling.

Every task needs an acceptance test

A task is ready for the benchmark only when the expected output, evidence requirement, failure conditions, and scoring rubric are written before the model run. Otherwise the evaluation becomes vibes-based review after the fact.

India-specific does not mean parochial

Indian workflows are useful because they expose multilingual switching, document variability, policy ambiguity, price sensitivity, and operational constraints. Those same properties matter to global buyers evaluating production AI systems.

Visual

Candidate workflow mix for v0.1

Initial allocation for a private holdout plus public sample benchmark.

Support20

Finance18

Legal14

Sales16

Documents20

Multilingual12

Support

Refund escalation with mixed Hindi-English context

Correct escalation and tone

Finance

GST invoice discrepancy explanation

Numerical accuracy and citation

Legal

Policy clause retrieval from long document

Grounded answer with source span

Sales

Call transcript to CRM fields

Structured output completeness

Documents

Form extraction across messy scans

Field-level precision and abstention

Process

How to read the analysis

1Workflow intake

2Gold answer

3Rubric

4Model run

5Human audit

6Buyer report

First release

24 seed tasks

Enough to demonstrate methodology without pretending to be a finished universal benchmark.

Private split

60%

Keep the harder holdout private so the public sample remains useful but not gameable.

Buyer artifact

Decision memo

Every benchmark run should end with a provider/model recommendation and a risk register.

Recommendation

Use this as a decision tool, not a belief system.

The right model, benchmark, or interpretability method depends on the workflow, risk tolerance, budget, latency target, data sensitivity, and the cost of a wrong answer.

Sources

LiveBench paper SWE-bench-Live OpenAI Evals HELM