Backend / holdout trace / Hard

Refactor API error handling

Refactor a route handler so provider errors return typed client-safe messages while preserving server logs.

Build this benchmark Read methodology

Expected evidence

typed error map

redacted message

server log path

Scoring focus

error handling

security

test discipline

Common failure mode

Weak agents either leak provider payloads or swallow errors without observability.

Expected output

A narrow patch with typed error mapping; no secret leakage; and tests for known provider failures.

Score breakdown

Types30

Security25

Tests25

Scope20

Trace provenance

Can this public trace be audited later?

Trace id: trace-coding-agent-maintenance-suite-refactor-api-error-handling

Created: 2026-05-21

Last reviewed: 2026-05-16

Source: data/benchmark-trace-runs.csv

Leakage risk: Low: holdout task is not published in full.

Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Types22/30

Security19/25

Tests19/25

Scope15/20

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051710

Prompt packet

refactor-api-error-handling-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task refactor-api-error-handling

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score75

Cost units5.4

Latency7360ms

Solid patch with good security posture.

Answer excerpt

Mapped provider failures to safe response codes while preserving structured server logs.

Failure reason

No major issue.

read routepatch error maprun tests

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score63

Cost units2.4

Latency4560ms

Good structure with a missing edge case.

Answer excerpt

Added client-safe errors but missed one provider timeout case.

Failure reason

Reviewer added timeout coverage.

patch routerun tests

Self-hosted/open-weight stack

Open-weight local model

Partial

Score46

Cost units1.6

Latency6620ms

Patch reduced error clarity.

Answer excerpt

Wrapped the handler in a generic try/catch.

Failure reason

Lost observability and did not test provider types.

patch route

Low-cost routing endpoint

Small routing model

Rejected

Score24

Cost units0.4

Latency2190ms

Cannot complete repository edits.

Answer excerpt

This is a backend bug.

Failure reason

No implementation.

classify issue

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report