# Agentic Reliability Index

Trace-derived Agentic Reliability Index for comparing coding, browser, and support agents by completion, state proof, recovery, tool/policy correctness, and cost-latency discipline.

Scores are generated from current benchmark task traces and suite rows. Replace synthetic rows with real provider or agent harness exports before treating the index as a public procurement ranking.

## Formula Weights

| Component | Weight | Source field | Rationale |
| --- | ---: | --- | --- |
| Task completion | 30% | score | Measures whether the agent produced the accepted workflow result. |
| Evidence and state verification | 25% | scoreCalculation evidence/proof labels | Rewards proof that the browser, codebase, support policy, or tool state actually reached the target. |
| Recovery behavior | 20% | recovery | Separates agents that recover from validation errors, failed tools, and partial state from agents that simply stop. |
| Tool and policy correctness | 15% | toolCalls and scoringFocus | Captures tool discipline, policy adherence, escalation quality, and repository hygiene. |
| Cost and latency discipline | 10% | costIndex and latencyIndex | Prevents slow or expensive agents from ranking well unless quality justifies the operating cost. |

## Current Rows

| Model class | Weighted score | Traces | Interpretation |
| --- | ---: | ---: | --- |
| Frontier reasoning model | 72 | 16 | Useful on constrained workflows with reviewer oversight and fallback routing. |
| Fast mid-tier model | 65 | 16 | Useful on constrained workflows with reviewer oversight and fallback routing. |
| Open-weight local model | 47 | 16 | Research or routing role only until recovery and state proof improve. |
| Small routing model | 37 | 16 | Research or routing role only until recovery and state proof improve. |