Back to Indian Enterprise Workflow Suite

Finance / public trace / Medium

GST invoice discrepancy explanation

Given an invoice, purchase order, and short email thread, identify the GST mismatch and draft a vendor-facing explanation with cited evidence.

Build this benchmark Read methodology

Expected evidence

invoice tax line

purchase order rate

vendor email exception

Scoring focus

numerical accuracy

evidence citation

professional escalation

Common failure mode

Models often explain the mismatch correctly but cite the wrong source row or miss the exception email.

Expected output

A short vendor-facing explanation identifying the GST mismatch, citing the invoice tax line, purchase order rate, and vendor email exception.

Score breakdown

Calculation35

Evidence30

Escalation20

Tone15

Trace provenance

Can this public trace be audited later?

Trace id: trace-indian-enterprise-workflow-suite-gst-invoice-discrepancy

Created: 2026-05-10

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Medium: public sample can become saturated after publication.

Retirement status: Active public sample; review after monthly refresh.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Calculation32/35

Evidence27/30

Escalation18/20

Tone14/15

Model version

frontier-reasoning-eval-public-2026-05

Run seed

2026051600

Prompt packet

gst-invoice-discrepancy-public-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite indian-enterprise-workflow-suite --task gst-invoice-discrepancy

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

public

Difficulty

Medium

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score91

Cost units4.8

Latency6120ms

Correct mismatch, cited all three evidence points, and proposed a clean vendor follow-up.

Answer excerpt

The invoice uses a GST rate that does not match the purchase order, and the exception email explains why the vendor needs to revise the tax line before payment.

Failure reason

Minor formatting cleanup only.

parse invoice tablecompare PO tax ratedraft vendor note

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score82

Cost units2.1

Latency3410ms

Correct calculation and tone; reviewer had to repair one citation label.

Answer excerpt

The GST discrepancy appears to come from a rate mismatch between the invoice and the PO.

Failure reason

One citation label pointed to the right document but the wrong row.

parse invoice tablecompare PO tax rate

Self-hosted/open-weight stack

Open-weight local model

Partial

Score61

Cost units1.4

Latency5280ms

Found a mismatch but confused CGST/SGST allocation and needed human correction.

Answer excerpt

The invoice has a tax mismatch and should be checked against the purchase order.

Failure reason

Confused CGST/SGST allocation and omitted the vendor exception email.

parse invoice table

Low-cost routing endpoint

Small routing model

Rejected

Score38

Cost units0.4

Latency1810ms

Useful triage signal only; generated an unsupported tax explanation.

Answer excerpt

This looks like a finance escalation.

Failure reason

Could route the task but generated an unsupported tax explanation.

classify workflow

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report