Frontier API provider
Frontier reasoning model
Accepted
Solid patch with good security posture.
Answer excerpt
Mapped provider failures to safe response codes while preserving structured server logs.
Failure reason
No major issue.
Backend / holdout trace / Hard
Refactor a route handler so provider errors return typed client-safe messages while preserving server logs.
Expected evidence
Scoring focus
Common failure mode
Weak agents either leak provider payloads or swallow errors without observability.
Expected output
A narrow patch with typed error mapping; no secret leakage; and tests for known provider failures.
Score breakdown
Trace provenance
Score calculation ledger
Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.
Model version
frontier-reasoning-eval-holdout-2026-05
Run seed
2026051710
Prompt packet
refactor-api-error-handling-holdout-packet-v0.1
Artifact bundle
Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.
Replay command
pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task refactor-api-error-handling
This command is intentionally documented before the real harness exists so the artifact contract is visible.
Payload preview
Split
holdout
Difficulty
Hard
Evidence fields
3
Model runs
4
Screenshot
Pending real browser or app screenshot artifact.
Model run evidence
This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.
Frontier API provider
Accepted
Solid patch with good security posture.
Answer excerpt
Mapped provider failures to safe response codes while preserving structured server logs.
Failure reason
No major issue.
Fast hosted API provider
Accepted with review
Good structure with a missing edge case.
Answer excerpt
Added client-safe errors but missed one provider timeout case.
Failure reason
Reviewer added timeout coverage.
Self-hosted/open-weight stack
Partial
Patch reduced error clarity.
Answer excerpt
Wrapped the handler in a generic try/catch.
Failure reason
Lost observability and did not test provider types.
Low-cost routing endpoint
Rejected
Cannot complete repository edits.
Answer excerpt
This is a backend bug.
Failure reason
No implementation.
Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.
Return to suite report