Leaderboards

Rankings for workflows, agents, cost, and latency.

These are prototype tracks for a public Indian AI benchmarking layer. The goal is not a single universal score; it is a set of buyer-relevant views with visible methodology.

Designing

Indian Workflow Index

69

Support, finance, legal, sales, document, and multilingual tasks scored by outcome, not trivia recall.

Task completion
Hindi-English robustness
Citation quality
Escalation handling

Prototype

Agentic Reliability Index

53

Measures tool-use planning, recovery, browser discipline, terminal usage, and whether agents can verify their own work.

Planning
Tool calls
Recovery
Verification

Live draft

Cost Efficiency Index

75

Converts provider pricing into cost per completed workflow using cache hit assumptions, output length, and retries.

Input cost
Output cost
Retry rate
Batch/cache leverage

Collecting

Latency and Throughput Index

74

Separates time-to-first-token, output tokens per second, queueing behavior, and provider variance.

TTFT
Tokens/sec
P95 latency
Availability

Methodology before rankings.

Every public number should trace back to task provenance, model settings, sample count, scoring rubric, and the failure cases the score hides.

Read methodology

Prototype model comparison

Frontier reasoning modelBest quality and recovery
Fast mid-tier modelBest default for support
Open-weight local modelUseful when privacy/control beats maximum benchmark quality.
Small routing modelCheap classifier/router

Agentic Reliability Formula

A trace-derived index for coding, browser, support, and security agents.

Trace-derived Agentic Reliability Index for comparing coding, browser, and support agents by completion, state proof, recovery, tool/policy correctness, and cost-latency discipline. The formula keeps completion, state proof, recovery, tool discipline, and operating cost visible instead of hiding them behind a single rank.

4

Suites

16

Traces

5

Weights

30%

Task completion

Measures whether the agent produced the accepted workflow result.

25%

Evidence and state verification

Rewards proof that the browser, codebase, support policy, or tool state actually reached the target.

20%

Recovery behavior

Separates agents that recover from validation errors, failed tools, and partial state from agents that simply stop.

15%

Tool and policy correctness

Captures tool discipline, policy adherence, escalation quality, and repository hygiene.

10%

Cost and latency discipline

Prevents slow or expensive agents from ranking well unless quality justifies the operating cost.

Model classIndexCompleteProofRecoverCost/latency
Frontier reasoning model7280667655
Fast mid-tier model6569596283
Open-weight local model4749444467
Small routing model3732323094

Prototype run matrix

How a buyer should read benchmark tradeoffs

Open Benchmark Lab
Model classWorkflowAgenticCostLatencyInterpretation
Frontier reasoning model88795060Best quality and recovery, expensive for high-volume routing.
Fast mid-tier model76688184Best default for support, extraction, and most back-office work.
Open-weight local model61527458Useful when privacy/control beats maximum benchmark quality.
Small routing model52409492Cheap classifier/router, not a final-answer model for hard cases.