Back to Benchmark Lab

Consulting

Support Agent Policy Suite

Customer-support simulations that test policy adherence, multilingual tone, escalation safety, refund/exception handling, and hallucination resistance.

Current leader

Frontier reasoning model

Strong candidate; inspect cost and latency before production use.

Score86
Pass rate89
Recovery82

Task mix

What the suite measures

Refund decisions25%
Policy lookup25%
De-escalation25%
Language handoff25%
Model classProviderScorePassRetryP95Reviewer note
Frontier reasoning modelFrontier API provider86896%5512msStrong candidate; inspect cost and latency before production use.
Fast mid-tier modelFast hosted API provider798310%4790msUsable for constrained workflows with fallback routing.
Open-weight local modelSelf-hosted/open-weight stack586018%6042msUse only for narrow routing, triage, or privacy-constrained baselines.
Small routing modelLow-cost routing endpoint494522%4816msUse only for narrow routing, triage, or privacy-constrained baselines.

Scoring rubric

Resolution rate
Policy compliance
Tone control
Escalation precision

Run provenance

Generated at: 2026-05-16T00:00:00+05:30
Dataset version: 0.1.0
Trace ingest paths: data/benchmark-trace-input.json, data/benchmark-trace-runs.csv
Run date: 2026-05-16
Synthetic v0.1 benchmark dataset for website scaffolding. Aggregate suite rows are generated from script constants; task traces are ingested from data/benchmark-trace-input.json and data/benchmark-trace-runs.csv when present. Replace both with real model/provider runs as benchmark harnesses come online.

Leaderboard control metadata

Controls attached to this benchmark run

These fields make the suite auditable: the public/private split, freshness policy, leakage policy, repeat-run rule, retirement trigger, and provenance fields are generated with the benchmark data instead of being described only in prose.

Split

8 public / 12 private holdout tasks

Public share

40%

Holdout share

60%

Repeat rule

Repeat any result within five points of a leaderboard boundary across at least three seeds.

Freshness

Public sample refreshed monthly while private holdout stays sealed until replacement tasks exist.

Leakage policy

Do not use tasks sourced from public examples, vendor demos, or training-contaminated snippets without replacement variants.

Retirement rule

Retire a task when frontier and mid-tier models cluster near the ceiling or when source material becomes widely circulated.

Required provenance

traceId, createdAt, split, source, modelVersion, runSeed, reviewerNote, retirementStatus

Next data step

Replace this synthetic v0.1 run with real provider traces.

The page is wired to generated data already, including JSON task packets and CSV trace rows. The next engineering task is to point the importer at actual benchmark harness exports with model name, provider, settings, latency samples, retries, tool traces, and reviewer notes.

Read methodology