# Support Agent Policy Suite Buyer Brief

Generated: 2026-05-16T00:00:00+05:30
Dataset: 0.1.0
Status: Consulting

## Executive Readout

- Average score: 68
- Current leader: Frontier reasoning model (86)
- Sample size: 20 tasks
- Public/private split: 8 public / 12 private holdout tasks
- Inspectable trace packets: 4

Strong candidate; inspect cost and latency before production use.

## Model Comparison

| Model class | Provider | Score | Pass rate | Recovery | Cost index | P95 latency |
| --- | --- | --- | --- | --- | --- | --- |
| Frontier reasoning model | Frontier API provider | 86 | 89 | 82 | 51 | 5512ms |
| Fast mid-tier model | Fast hosted API provider | 79 | 83 | 70 | 83 | 4790ms |
| Open-weight local model | Self-hosted/open-weight stack | 58 | 60 | 47 | 74 | 6042ms |
| Small routing model | Low-cost routing endpoint | 49 | 45 | 35 | 93 | 4816ms |

## Representative Trace Packets

| Task | Domain | Split | Difficulty | Top run | Score |
| --- | --- | --- | --- | --- | --- |
| Refund policy boundary case | Refund decisions | public | Medium | Frontier reasoning model | 88 |
| Regional-language human handoff | Language handoff | holdout | Hard | Frontier reasoning model | 83 |
| Subscription downgrade save | Refund decisions | public | Medium | Frontier reasoning model | 87 |
| PII redaction escalation | Policy lookup | holdout | Hard | Frontier reasoning model | 85 |

## Task Mix

| Category | Share |
| --- | --- |
| Refund decisions | 25% |
| Policy lookup | 25% |
| De-escalation | 25% |
| Language handoff | 25% |

## Scoring Rubric

- Resolution rate
- Policy compliance
- Tone control
- Escalation precision

## Leaderboard Controls

- Freshness: Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist.
- Leakage policy: Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.
- Repeat-run rule: Repeat any result within five points of a leaderboard boundary across at least three seeds.
- Retirement rule: Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.
- Required provenance: traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus

## Recommended Next Step

Use this brief to decide which workflow should become a real private eval run. Replace synthetic rows with harness exports that include raw prompts, exact model identifiers, latency samples, screenshots or tool logs, scorer identity, and replay links.

Contact: sanjay@edxperimentallabs.com or saujas@edxperimentallabs.com
