Leaderboards

Rankings for workflows, agents, cost, and latency.

These are prototype tracks for a public Indian AI benchmarking layer. The goal is not a single universal score; it is a set of buyer-relevant views with visible methodology.

Designing

Indian Workflow Index

Support, finance, legal, sales, document, and multilingual tasks scored by outcome, not trivia recall.

Task completion

Hindi-English robustness

Citation quality

Escalation handling

Prototype

Agentic Reliability Index

Measures tool-use planning, recovery, browser discipline, terminal usage, and whether agents can verify their own work.

Planning

Tool calls

Recovery

Verification

Live draft

Cost Efficiency Index

Converts provider pricing into cost per completed workflow using cache hit assumptions, output length, and retries.

Input cost

Output cost

Retry rate

Batch/cache leverage

Collecting

Latency and Throughput Index

Separates time-to-first-token, output tokens per second, queueing behavior, and provider variance.

TTFT

Tokens/sec

P95 latency

Availability

Methodology before rankings.

Every public number should trace back to task provenance, model settings, sample count, scoring rubric, and the failure cases the score hides.

Read methodology

Prototype model comparison

Frontier reasoning modelBest quality and recovery

Fast mid-tier modelBest default for support

Open-weight local modelUseful when privacy/control beats maximum benchmark quality.

Small routing modelCheap classifier/router

Agentic Reliability Formula

A trace-derived index for coding, browser, support, and security agents.

Trace-derived Agentic Reliability Index for comparing coding, browser, and support agents by completion, state proof, recovery, tool/policy correctness, and cost-latency discipline. The formula keeps completion, state proof, recovery, tool discipline, and operating cost visible instead of hiding them behind a single rank.

Suites

Traces

Weights

Markdown index JSON index

30%

Task completion

Measures whether the agent produced the accepted workflow result.

25%

Evidence and state verification

Rewards proof that the browser, codebase, support policy, or tool state actually reached the target.

20%

Recovery behavior

Separates agents that recover from validation errors, failed tools, and partial state from agents that simply stop.

15%

Tool and policy correctness

Captures tool discipline, policy adherence, escalation quality, and repository hygiene.

10%

Cost and latency discipline

Prevents slow or expensive agents from ranking well unless quality justifies the operating cost.

Model classIndexCompleteProofRecoverCost/latency

Frontier reasoning model7280667655

Fast mid-tier model6569596283

Open-weight local model4749444467

Small routing model3732323094

Prototype run matrix

How a buyer should read benchmark tradeoffs

Open Benchmark Lab

Model classWorkflowAgenticCostLatencyInterpretation

Frontier reasoning model88795060Best quality and recovery, expensive for high-volume routing.

Fast mid-tier model76688184Best default for support, extraction, and most back-office work.

Open-weight local model61527458Useful when privacy/control beats maximum benchmark quality.

Small routing model52409492Cheap classifier/router, not a final-answer model for hard cases.