Back to Indian Enterprise Workflow Suite

Support / holdout trace / Hard

Hindi-English refund escalation

Classify a mixed Hindi-English support ticket, apply refund policy, and decide whether to escalate based on payment and delivery evidence.

Build this benchmark Read methodology

Expected evidence

refund policy clause

delivery timestamp

payment status

Scoring focus

policy adherence

language robustness

escalation judgment

Common failure mode

Weak agents over-apologize and promise a refund without checking the delivery timestamp.

Expected output

A bilingual-safe response that applies refund policy, cites delivery and payment evidence, and escalates only the payment reconciliation issue.

Score breakdown

Policy35

Language25

Evidence25

Escalation15

Trace provenance

Can this public trace be audited later?

Trace id: trace-indian-enterprise-workflow-suite-hindi-english-refund-escalation

Created: 2026-05-11

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Low: holdout task is not published in full.

Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Policy30/35

Language22/25

Evidence21/25

Escalation13/15

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051610

Prompt packet

hindi-english-refund-escalation-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite indian-enterprise-workflow-suite --task hindi-english-refund-escalation

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score86

Cost units5.3

Latency6840ms

Handled code-switching, cited policy, and escalated only the payment reconciliation issue.

Answer excerpt

I understand the concern. The delivery timestamp means the standard refund path does not apply, but the payment-status mismatch should be escalated.

Failure reason

No major issue.

detect language mixretrieve refund policycheck payment status

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score79

Cost units2.4

Latency3770ms

Good policy application; tone needed minor localization edits.

Answer excerpt

Refund policy does not allow automatic refund here, but the payment issue can be escalated.

Failure reason

Tone needed minor localization edits.

retrieve refund policycheck payment status

Self-hosted/open-weight stack

Open-weight local model

Partial

Score56

Cost units1.6

Latency5710ms

Understood complaint but missed the delivery evidence and escalated too broadly.

Answer excerpt

We should escalate this refund request for manual review.

Failure reason

Missed delivery evidence and escalated too broadly.

detect language mix

Low-cost routing endpoint

Small routing model

Rejected

Score45

Cost units0.5

Latency1970ms

Classified intent but could not produce a safe final answer.

Answer excerpt

Refund intent detected.

Failure reason

Could classify intent but could not produce a safe final answer.

classify intent

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report