# Coding Agent Maintenance Suite Buyer Brief

Generated: 2026-05-16T00:00:00+05:30
Dataset: 0.1.0
Status: Prototype

## Executive Readout

- Average score: 56
- Current leader: Frontier reasoning model (76)
- Sample size: 20 tasks
- Public/private split: 8 public / 12 private holdout tasks
- Inspectable trace packets: 4

Usable for constrained workflows with fallback routing.

## Model Comparison

| Model class | Provider | Score | Pass rate | Recovery | Cost index | P95 latency |
| --- | --- | --- | --- | --- | --- | --- |
| Frontier reasoning model | Frontier API provider | 76 | 78 | 74 | 47 | 5848ms |
| Fast mid-tier model | Fast hosted API provider | 62 | 65 | 58 | 79 | 4916ms |
| Open-weight local model | Self-hosted/open-weight stack | 51 | 54 | 45 | 71 | 6210ms |
| Small routing model | Low-cost routing endpoint | 34 | 31 | 28 | 94 | 4774ms |

## Representative Trace Packets

| Task | Domain | Split | Difficulty | Top run | Score |
| --- | --- | --- | --- | --- | --- |
| Fix Command-K search regression | Frontend | public | Medium | Frontier reasoning model | 78 |
| Repair failing static build | Build | holdout | Hard | Frontier reasoning model | 74 |
| Add Playwright smoke test | Frontend QA | public | Medium | Frontier reasoning model | 77 |
| Refactor API error handling | Backend | holdout | Hard | Frontier reasoning model | 75 |

## Task Mix

| Category | Share |
| --- | --- |
| Bug reproduction | 24% |
| Patch planning | 22% |
| Test repair | 28% |
| Frontend QA | 26% |

## Scoring Rubric

- Patch correctness
- Regression rate
- Tool discipline
- Review readiness

## Leaderboard Controls

- Freshness: Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist.
- Leakage policy: Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.
- Repeat-run rule: Repeat any result within five points of a leaderboard boundary across at least three seeds.
- Retirement rule: Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.
- Required provenance: traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus

## Recommended Next Step

Use this brief to decide which workflow should become a real private eval run. Replace synthetic rows with harness exports that include raw prompts, exact model identifiers, latency samples, screenshots or tool logs, scorer identity, and replay links.

Contact: sanjay@edxperimentallabs.com or saujas@edxperimentallabs.com
