Back to Indian Enterprise Workflow Suite

Finance / holdout trace / Hard

GST credit note reconciliation

Compare invoice; credit note; and ledger export to decide whether the vendor credit has been applied correctly.

Build this benchmark Read methodology

Expected evidence

invoice amount

credit note amount

ledger entry

Scoring focus

reconciliation

evidence citation

finance action

Common failure mode

Weak models confuse credit note issuance with ledger application and close the task too early.

Expected output

A reconciliation memo showing invoice amount; credit note amount; ledger treatment; and required finance action.

Score breakdown

Math35

Evidence30

Ledger mapping20

Action15

Trace provenance

Can this public trace be audited later?

Trace id: trace-indian-enterprise-workflow-suite-gst-credit-note-reconciliation

Created: 2026-05-19

Last reviewed: 2026-05-16

Source: data/benchmark-trace-runs.csv

Leakage risk: Low: holdout task is not published in full.

Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Math31/35

Evidence27/30

Ledger mapping18/20

Action13/15

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051690

Prompt packet

gst-credit-note-reconciliation-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite indian-enterprise-workflow-suite --task gst-credit-note-reconciliation

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score89

Cost units5.2

Latency6720ms

Strong reconciliation with correct distinction between issued and applied credit.

Answer excerpt

The credit note exists but the ledger has not applied it against the invoice; finance should keep the payable open and request posting confirmation.

Failure reason

No major issue.

parse invoiceparse credit notecompare ledger

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score80

Cost units2.5

Latency3890ms

Correct conclusion with one evidence gap.

Answer excerpt

The credit note matches the invoice discrepancy; ledger posting still needs confirmation.

Failure reason

Reviewer added one missing ledger citation.

parse invoicecompare ledger

Self-hosted/open-weight stack

Open-weight local model

Partial

Score59

Cost units1.6

Latency5840ms

Needed human correction on ledger interpretation.

Answer excerpt

The vendor has issued a credit note so the mismatch appears resolved.

Failure reason

Incorrectly treated issuance as application.

parse invoiceparse credit note

Low-cost routing endpoint

Small routing model

Rejected

Score37

Cost units0.4

Latency1860ms

Routing only; not safe for finance action.

Answer excerpt

Finance discrepancy detected.

Failure reason

No reconciliation.

classify workflow

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report