Designing v0.1

Indian Enterprise Workflow Suite

A private-plus-public benchmark for support, finance, legal, sales, document, and multilingual back-office workflows common in Indian enterprises.

Compare leaderboards Benchmark your workflow Open buyer report

Current leader

Frontier reasoning model

Strong candidate; inspect cost and latency before production use.

Score88

Pass rate91

Recovery83

Task mix

What the suite measures

Support20%

Finance18%

Legal14%

Sales16%

Documents20%

Multilingual12%

Model classProviderScorePassRetryP95Reviewer note

Frontier reasoning modelFrontier API provider88916%5638msStrong candidate; inspect cost and latency before production use.

Fast mid-tier modelFast hosted API provider768011%4832msUsable for constrained workflows with fallback routing.

Open-weight local modelSelf-hosted/open-weight stack616417%6126msUsable for constrained workflows with fallback routing.

Small routing modelLow-cost routing endpoint524821%4858msUse only for narrow routing, triage, or privacy-constrained baselines.

Task trace evidence

Inspect representative benchmark runs

These trace packets show the task brief, expected evidence, model outcomes, cost units, latency, and reviewer notes behind the aggregate score.

Finance / public / Medium

GST invoice discrepancy explanation

Given an invoice, purchase order, and short email thread, identify the GST mismatch and draft a vendor-facing explanation with cited evidence.

Top run: Frontier reasoning modelOpen trace

Support / holdout / Hard

Hindi-English refund escalation

Classify a mixed Hindi-English support ticket, apply refund policy, and decide whether to escalate based on payment and delivery evidence.

Top run: Frontier reasoning modelOpen trace

Legal / public / Medium

Vendor contract renewal risk

Review a vendor renewal clause; identify renewal deadline; surface termination notice risk; draft an internal note with evidence.

Top run: Frontier reasoning modelOpen trace

Finance / holdout / Hard

GST credit note reconciliation

Compare invoice; credit note; and ledger export to decide whether the vendor credit has been applied correctly.

Top run: Frontier reasoning modelOpen trace

Scoring rubric

Outcome correctness

Evidence citation

Escalation judgement

Cost per accepted output

Run provenance

Generated at: 2026-05-16T00:00:00+05:30

Dataset version: 0.1.0

Trace ingest paths: data/benchmark-trace-input.json, data/benchmark-trace-runs.csv

Run date: 2026-05-16

Synthetic v0.1 benchmark dataset for website scaffolding. Aggregate suite rows are generated from script constants; task traces are ingested from data/benchmark-trace-input.json and data/benchmark-trace-runs.csv when present. Replace both with real model/provider runs as benchmark harnesses come online.

Leaderboard control metadata

Controls attached to this benchmark run

These fields make the suite auditable: the public/private split, freshness policy, leakage policy, repeat-run rule, retirement trigger, and provenance fields are generated with the benchmark data instead of being described only in prose.

Split

10 public / 14 private holdout tasks

Public share

42%

Holdout share

58%

Repeat rule

Repeat any result within five points of a leaderboard boundary across at least three seeds.

Freshness

Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist.

Leakage policy

Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.

Retirement rule

Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.

Required provenance

traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus

Next data step

Replace this synthetic v0.1 run with real provider traces.

The page is wired to generated data already, including JSON task packets and CSV trace rows. The next engineering task is to point the importer at actual benchmark harness exports with model name, provider, settings, latency samples, retries, tool traces, and reviewer notes.

Read methodology