Agents
Agent Benchmarks That Survive Real Work
A map of coding, terminal, browser, OS, and customer-support agent benchmarks, and what each misses when used alone.
Agents
A map of coding, terminal, browser, OS, and customer-support agent benchmarks, and what each misses when used alone.
Research lens
6 suites
5 failure modes
A useful agent benchmark should measure the full loop: task understanding, planning, tool selection, state tracking, error recovery, verification, and final handoff. A model that answers a prompt well can still fail when the page changes, the terminal output is noisy, or a tool returns partial evidence.
SWE-bench-style repository work tests code edits and tests; terminal tasks test shell discipline; browser suites test UI state and navigation; customer-support simulations test policy and conversation control. Edxperimental Labs should publish separate tracks and a combined agentic reliability score.
Most agent failures are not first-step failures. They happen when the agent sees an unexpected modal, a failing test, a missing selector, stale instructions, or an ambiguous customer policy. Recovery should be scored explicitly, not buried inside pass/fail.
Visual
Illustrative coverage score across code, shell, browser, OS, and support workflows.
SWE-bench
Code repositories
Can the agent make real patches against issue-style tasks?
Terminal-Bench
Shell and CLI
Can the agent operate a terminal without derailing?
BrowserGym/WebArena
Browser tasks
Can the agent navigate and verify web state?
OSWorld
Desktop OS
Can the agent handle computer-use workflows?
tau-bench
Tool-agent dialogue
Can the agent follow policy while using tools?
Process
Core metric
Resolved task
Do not score tool calls alone; score whether the target job was completed with evidence.
Weak spot
Recovery
Unexpected UI and failing tests are where agent quality separates.
Edxperimental track
Agentic reliability
Blend coding, browser, support, and terminal tasks into a buyer-readable index.
Recommendation
The right model, benchmark, or interpretability method depends on the workflow, risk tolerance, budget, latency target, data sensitivity, and the cost of a wrong answer.