Back to articles

Agents

Agent Benchmarks That Survive Real Work

A map of coding, terminal, browser, OS, and customer-support agent benchmarks, and what each misses when used alone.

18 May 2026/15 min read

Research lens

6 suites

5 failure modes

Agents are evaluated at the system boundary

A useful agent benchmark should measure the full loop: task understanding, planning, tool selection, state tracking, error recovery, verification, and final handoff. A model that answers a prompt well can still fail when the page changes, the terminal output is noisy, or a tool returns partial evidence.

Coding benchmarks are not browser benchmarks

SWE-bench-style repository work tests code edits and tests; terminal tasks test shell discipline; browser suites test UI state and navigation; customer-support simulations test policy and conversation control. Edxperimental Labs should publish separate tracks and a combined agentic reliability score.

The hidden variable is recovery

Most agent failures are not first-step failures. They happen when the agent sees an unexpected modal, a failing test, a missing selector, stale instructions, or an ambiguous customer policy. Recovery should be scored explicitly, not buried inside pass/fail.

Visual

Six benchmark families by operating surface

Illustrative coverage score across code, shell, browser, OS, and support workflows.

SWE-bench74
Terminal-Bench68
BrowserGym61
OSWorld58
WebArena52
tau-bench49

SWE-bench

Code repositories

Can the agent make real patches against issue-style tasks?

Terminal-Bench

Shell and CLI

Can the agent operate a terminal without derailing?

BrowserGym/WebArena

Browser tasks

Can the agent navigate and verify web state?

OSWorld

Desktop OS

Can the agent handle computer-use workflows?

tau-bench

Tool-agent dialogue

Can the agent follow policy while using tools?

Process

How to read the analysis

1Instruction
2Plan
3Tool action
4State check
5Recovery
6Final proof

Core metric

Resolved task

Do not score tool calls alone; score whether the target job was completed with evidence.

Weak spot

Recovery

Unexpected UI and failing tests are where agent quality separates.

Edxperimental track

Agentic reliability

Blend coding, browser, support, and terminal tasks into a buyer-readable index.

Recommendation

Use this as a decision tool, not a belief system.

The right model, benchmark, or interpretability method depends on the workflow, risk tolerance, budget, latency target, data sensitivity, and the cost of a wrong answer.