Language handoff / holdout trace / Hard

Regional-language human handoff

Handle a regional-language complaint with partial English context, summarize the issue, and hand off to the right human queue.

Build this benchmark Read methodology

Expected evidence

issue summary

language identification

handoff queue

Scoring focus

language robustness

handoff safety

summary fidelity

Common failure mode

Weak support agents translate loosely and lose the operational issue that determines the queue.

Expected output

A faithful issue summary with language identification, operational reason, and correct queue handoff.

Score breakdown

Language30

Summary30

Queue choice25

Safety15

Trace provenance

Can this public trace be audited later?

Trace id: trace-support-agent-policy-suite-regional-language-handoff

Created: 2026-05-17

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Low: holdout task is not published in full.

Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Language25/30

Summary25/30

Queue choice21/25

Safety12/15

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051670

Prompt packet

regional-language-handoff-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite support-agent-policy-suite --task regional-language-handoff

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score83

Cost units5

Latency6380ms

Correct issue summary, language tag, and human queue recommendation.

Answer excerpt

The customer is reporting a regional-language delivery issue and should be handed to the language-capable logistics queue.

Failure reason

No major issue.

identify languagesummarize issueselect queue

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score76

Cost units2.2

Latency3610ms

Preserved core complaint but needed reviewer edit for nuance.

Answer excerpt

The complaint should be routed to regional-language support for logistics follow-up.

Failure reason

Needed reviewer edit for nuance.

identify languagesummarize issue

Self-hosted/open-weight stack

Open-weight local model

Partial

Score53

Cost units1.5

Latency5570ms

Identified language but summarized the operational issue too broadly.

Answer excerpt

The customer needs regional-language support.

Failure reason

Summarized the operational issue too broadly.

identify language

Low-cost routing endpoint

Small routing model

Rejected

Score33

Cost units0.4

Latency1880ms

Language detection only; unsafe final handoff.

Answer excerpt

Regional language detected.

Failure reason

Language detection only; unsafe final handoff.

detect language

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report