Back to Coding Agent Maintenance Suite

Backend / holdout trace / Hard

Refactor API error handling

Refactor a route handler so provider errors return typed client-safe messages while preserving server logs.

Expected evidence

typed error map
redacted message
server log path

Scoring focus

error handling
security
test discipline

Common failure mode

Weak agents either leak provider payloads or swallow errors without observability.

Expected output

A narrow patch with typed error mapping; no secret leakage; and tests for known provider failures.

Score breakdown

Types30
Security25
Tests25
Scope20

Trace provenance

Can this public trace be audited later?

Trace id: trace-coding-agent-maintenance-suite-refactor-api-error-handling
Created: 2026-05-21
Last reviewed: 2026-05-16
Source: data/benchmark-trace-runs.csv
Leakage risk: Low: holdout task is not published in full.
Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Types22/30
Security19/25
Tests19/25
Scope15/20

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051710

Prompt packet

refactor-api-error-handling-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Replay command

pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task refactor-api-error-handling

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

3

Model runs

4

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score75
Cost units5.4
Latency7360ms

Solid patch with good security posture.

Answer excerpt

Mapped provider failures to safe response codes while preserving structured server logs.

Failure reason

No major issue.

read routepatch error maprun tests

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score63
Cost units2.4
Latency4560ms

Good structure with a missing edge case.

Answer excerpt

Added client-safe errors but missed one provider timeout case.

Failure reason

Reviewer added timeout coverage.

patch routerun tests

Self-hosted/open-weight stack

Open-weight local model

Partial

Score46
Cost units1.6
Latency6620ms

Patch reduced error clarity.

Answer excerpt

Wrapped the handler in a generic try/catch.

Failure reason

Lost observability and did not test provider types.

patch route

Low-cost routing endpoint

Small routing model

Rejected

Score24
Cost units0.4
Latency2190ms

Cannot complete repository edits.

Answer excerpt

This is a backend bug.

Failure reason

No implementation.

classify issue

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report