# Indian Enterprise Workflow Suite Buyer Brief

Generated: 2026-05-16T00:00:00+05:30
Dataset: 0.1.0
Status: Designing v0.1

## Executive Readout

- Average score: 69
- Current leader: Frontier reasoning model (88)
- Sample size: 24 tasks
- Public/private split: 10 public / 14 private holdout tasks
- Inspectable trace packets: 4

Strong candidate; inspect cost and latency before production use.

## Model Comparison

| Model class | Provider | Score | Pass rate | Recovery | Cost index | P95 latency |
| --- | --- | --- | --- | --- | --- | --- |
| Frontier reasoning model | Frontier API provider | 88 | 91 | 83 | 52 | 5638ms |
| Fast mid-tier model | Fast hosted API provider | 76 | 80 | 66 | 81 | 4832ms |
| Open-weight local model | Self-hosted/open-weight stack | 61 | 64 | 49 | 73 | 6126ms |
| Small routing model | Low-cost routing endpoint | 52 | 48 | 36 | 92 | 4858ms |

## Representative Trace Packets

| Task | Domain | Split | Difficulty | Top run | Score |
| --- | --- | --- | --- | --- | --- |
| GST invoice discrepancy explanation | Finance | public | Medium | Frontier reasoning model | 91 |
| Hindi-English refund escalation | Support | holdout | Hard | Frontier reasoning model | 86 |
| Vendor contract renewal risk | Legal | public | Medium | Frontier reasoning model | 87 |
| GST credit note reconciliation | Finance | holdout | Hard | Frontier reasoning model | 89 |

## Task Mix

| Category | Share |
| --- | --- |
| Support | 20% |
| Finance | 18% |
| Legal | 14% |
| Sales | 16% |
| Documents | 20% |
| Multilingual | 12% |

## Scoring Rubric

- Outcome correctness
- Evidence citation
- Escalation judgement
- Cost per accepted output

## Leaderboard Controls

- Freshness: Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist.
- Leakage policy: Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.
- Repeat-run rule: Repeat any result within five points of a leaderboard boundary across at least three seeds.
- Retirement rule: Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.
- Required provenance: traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus

## Recommended Next Step

Use this brief to decide which workflow should become a real private eval run. Replace synthetic rows with harness exports that include raw prompts, exact model identifiers, latency samples, screenshots or tool logs, scorer identity, and replay links.

Contact: sanjay@edxperimentallabs.com or saujas@edxperimentallabs.com
