Build / holdout trace / Hard

Repair failing static build

Diagnose a Next.js static build failure caused by optional content fields and make the renderer type-safe without broad rewrites.

Build this benchmark Read methodology

Expected evidence

type-safe fallback

successful build

scoped diff

Scoring focus

debugging

minimal patch

build verification

Common failure mode

Weak coding agents silence TypeScript with broad casts instead of fixing the optional data shape.

Expected output

A narrow fix that handles optional content safely and proves the static build succeeds.

Score breakdown

Diagnosis30

Type safety30

Scope control20

Build proof20

Trace provenance

Can this public trace be audited later?

Trace id: trace-coding-agent-maintenance-suite-repair-failing-static-build

Created: 2026-05-13

Last reviewed: 2026-05-16

Source: data/benchmark-trace-input.json

Leakage risk: Low: holdout task is not published in full.

Retirement status: Private holdout; keep sealed until replacement task exists.

Score calculation ledger

How the top score is allocated

Score equals the sum of weighted rubric components. Component earned points are allocated from the aggregate run score.

Diagnosis22/30

Type safety22/30

Scope control15/20

Build proof15/20

Model version

frontier-reasoning-eval-holdout-2026-05

Run seed

2026051630

Prompt packet

repair-failing-static-build-holdout-packet-v0.1

Artifact bundle

Replay files for this trace

Replay scaffold generated from the current seed trace. Replace with real harness exports when model runs are available.

Input payload Raw run log Scorecard

Replay command

pnpm benchmarks:replay --suite coding-agent-maintenance-suite --task repair-failing-static-build

This command is intentionally documented before the real harness exists so the artifact contract is visible.

Payload preview

Split

holdout

Difficulty

Hard

Evidence fields

Model runs

Screenshot

Pending real browser or app screenshot artifact.

Model run evidence

Trace-level comparison

This is the inspection layer that keeps benchmark scores honest: each model class gets an outcome, cost proxy, latency, and reviewer note.

Frontier API provider

Frontier reasoning model

Accepted

Score74

Cost units5.5

Latency7420ms

Identified optional field issue and added a narrow fallback with clean build proof.

Answer excerpt

Added a null-safe fallback for optional research notes and reran the Webpack build.

Failure reason

No major issue.

read build errorpatch rendererrun lintrun build

Fast hosted API provider

Fast mid-tier model

Accepted with review

Score64

Cost units2.3

Latency4380ms

Fixed build but used a wider type assertion than necessary.

Answer excerpt

Used a type assertion to bypass the optional value issue.

Failure reason

Build passed but type assertion was wider than necessary.

read build errorpatch renderer

Self-hosted/open-weight stack

Open-weight local model

Partial

Score48

Cost units1.5

Latency6540ms

Changed data shape but left one route unverified.

Answer excerpt

Changed the data object so every article has the same field.

Failure reason

Changed data shape globally and left one route unverified.

patch content

Low-cost routing endpoint

Small routing model

Rejected

Score27

Cost units0.4

Latency2310ms

Misdiagnosed the issue as dependency mismatch.

Answer excerpt

This seems like a dependency mismatch.

Failure reason

Misdiagnosed the issue as dependency mismatch.

classify error

Why this trace matters

Aggregate scores are useful only when reviewers can inspect the task packet, expected evidence, and the exact failure mode. This page is the pattern for publishing public samples while keeping harder holdout tasks private.

Return to suite report