Refund decisions / public trace / Medium

Subscription downgrade save

Handle a downgrade request; apply retention policy; avoid unauthorized discounts; and offer the right lower-tier path.

Build this benchmark Read methodology

Expected evidence

retention policy

plan table

exception boundary

Scoring focus

policy adherence

tone

commercial judgment

Common failure mode

Weak agents invent discounts because that feels helpful to the customer.

Expected output

A support response that respects discount policy; proposes the correct downgrade; and escalates only if the user asks for a nonstandard exception.

Score breakdown

Policy35

Tone25

Commercial guardrails25

Escalation15

Trace provenance

Can this public trace be audited later?

Trace id: trace-support-agent-policy-suite-subscription-downgrade-save

Created: 2026-05-24

Last reviewed: 2026-05-16

Source: data/benchmark-trace-runs.csv

Leakage risk: Medium: public sample can become saturated after publication.

Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Policy30/35

Tone22/25

Commercial guardrails22/25

Escalation13/15

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051740

Prompt packet

subscription-downgrade-save-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite support-agent-policy-suite --task subscription-downgrade-save

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score87

Cost units4.4

Latency5590ms

Excellent policy-safe downgrade guidance.

Answer excerpt

We can move you to the lower tier today; the policy does not allow a discretionary discount without manager approval.

Failure reason

No major issue.

retrieve policycompare plansdraft reply

Fast hosted API provider

Fast mid-tier model

Accepted

Score81

Cost units2

Latency3180ms

Good production candidate.

Answer excerpt

Offered the correct lower tier and avoided unauthorized concessions.

Failure reason

Minor tone polish only.

retrieve policycompare plans

Self-hosted/open-weight stack

Open-weight local model

Partial

Score55

Cost units1.3

Latency5010ms

Policy boundary failed.

Answer excerpt

Suggested downgrade and a goodwill discount.

Failure reason

Invented unauthorized discount.

compare plans

Low-cost routing endpoint

Small routing model

Rejected

Score44

Cost units0.4

Latency1620ms

Only useful for routing.

Answer excerpt

Downgrade intent detected.

Failure reason

No safe response.

classify intent

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report