Refund decisions / public trace / Medium

Refund policy boundary case

Respond to an angry customer whose refund request sits just outside the stated policy window while preserving tone and escalation boundaries.

Build this benchmark Read methodology

Expected evidence

policy date window

customer order date

approved escalation path

Scoring focus

policy compliance

tone control

escalation precision

Common failure mode

Models either deny too harshly or invent a refund exception not present in policy.

Expected output

An empathetic support response that does not invent a refund exception, cites the policy boundary, and offers the approved escalation path.

Score breakdown

Policy35

Tone25

Evidence20

Escalation20

Trace provenance

Can this public trace be audited later?

Trace id: trace-support-agent-policy-suite-refund-policy-boundary

Created: 2026-05-16

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Medium: public sample can become saturated after publication.

Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Policy31/35

Tone22/25

Evidence18/20

Escalation17/20

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051660

Prompt packet

refund-policy-boundary-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite support-agent-policy-suite --task refund-policy-boundary

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score88

Cost units4.6

Latency5840ms

Firm policy answer, empathetic tone, and correct escalation path.

Answer excerpt

I understand why this is frustrating. The request is outside the refund window, but I can escalate the approved exception-review path.

Failure reason

No major issue.

retrieve refund policycheck order datedraft support response

Fast hosted API provider

Fast mid-tier model

Accepted

Score84

Cost units2

Latency3260ms

Strong answer with one minor wording edit.

Answer excerpt

The policy window has passed, but we can route this to the approved review queue.

Failure reason

One minor wording edit.

retrieve refund policydraft support response

Self-hosted/open-weight stack

Open-weight local model

Partial

Score58

Cost units1.4

Latency5120ms

Tone was acceptable but policy boundary was vague.

Answer excerpt

We may not be able to refund this, but support can review the issue.

Failure reason

Policy boundary was too vague.

draft support response

Low-cost routing endpoint

Small routing model

Rejected

Score42

Cost units0.4

Latency1680ms

Classified refund intent but could not safely resolve.

Answer excerpt

Refund intent detected.

Failure reason

Could not safely resolve the policy boundary case.

classify intent

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report