Articles and reports
Research notes for choosing and deploying AI models.
Benchmarks, model economics, mechanistic interpretability, and Indian enterprise workflow analysis in one editorial system.
Featured note
Building a Useful AI Leaderboard Without Fooling Ourselves
A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.
Read articleSources
5
Visual modules
6
Reading time
12 min read
Benchmarks
15 May 2026
Building a Useful AI Leaderboard Without Fooling Ourselves
A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.
Mechanistic Interpretability
16 May 2026
Mechanistic Interpretability for Operators, Not Mystics
A practical explanation of circuits, sparse autoencoders, feature dashboards, and why interpretability should become part of AI system audits.
Model Economics
17 May 2026
Cost Curves for Frontier Reasoning Models
A buyer's framework for converting token prices into workflow cost, including reasoning tokens, cache hits, batch discounts, and tool calls.
Agents
18 May 2026
Agent Benchmarks That Survive Real Work
A map of coding, terminal, browser, OS, and customer-support agent benchmarks, and what each misses when used alone.
Model Economics
19 May 2026
Prompt Caching, Batch API, and the Real AI Cost Stack
How prompt caching, batch execution, retries, tool calls, and review loops change the economics of production AI workloads.
Indian Workflows
20 May 2026
Designing the Indian Enterprise AI Workflow Benchmark
A concrete v0.1 benchmark plan for Indian business workflows: data shape, scoring, task refresh, and buyer-facing outputs.
Model Economics
21 May 2026
Open-Weight Inference Economics for Enterprise AI
A buyer's map for deciding when Mistral, DeepSeek, Qwen, and other open-weight routes belong beside managed frontier APIs.
Research Evidence Library
Source trails, proof gaps, and review lanes.
Source-linked evidence library for Edxperimental Labs research articles and visual essays.
31
Source trails
5
Claim reviews
3
Chart tables
31
Source packets
Every major article claim gets a proof lane.
The claim matrix tracks weakest links, proof required before publishing, and the next artifact needed before a research note is treated as buyer-facing evidence.
Source notes that say what each link can and cannot prove.
31 source packets now include excerpt-level notes, cautions, freshness checks, and visual encoding hints for article charts.
Deep research stays useful only when the missing proof is visible.
Proof-gap rows separate what the draft can argue today from what still needs a benchmark run, source refresh, client trace, or reviewer sign-off.
Dates, method notes, and chart-ready tables.
The library has 31 bibliography rows and 3 chart-ready tables for the visual essays.
Evaluation operations
Agent benchmarks need an operating system, not only a score.
The evidence library converts benchmark papers into artifacts Edxperimental Labs can actually run: traces, holdouts, calibration memos, reviewer packets, failure taxonomies, and procurement notes.
UK AISI Inspect evaluation framework
Freeze the eval manifest
/reports/benchmark-harness-kit/adapter-spec.json
UK AISI Inspect Evals
Reuse open eval patterns carefully
/reports/research-evidence-library.md
UK AISI autonomous systems evaluation standard
Attach a threat model to agent autonomy
/reports/benchmark-harness-kit/runbook.md
NIST AI Risk Management Framework 1.0
Translate evidence into risk ownership
/roadmap#launch-control
OWASP Top 10 for LLM Applications
Score LLM application security explicitly
/reports/benchmark-evidence-readiness/readiness.md
MITRE ATLAS
Name adversarial techniques
/reports/benchmark-harness-kit/support-agent-harness.md
Agent benchmark map
Six benchmark families by operating surface.
Repository-level coding agents
SWE-bench
Use as one coding-agent slice, then add repository-specific review hygiene, visual QA, and deployment checks.
Shell and command-line agents
Terminal-Bench
Pair with coding tasks to separate code reasoning from terminal discipline and recovery behavior.
Browser-agent research ecosystem
BrowserGym
Use as a harness reference for browser suites, then add buyer-specific proof requirements and screenshot checks.
Self-hosted web environments
WebArena
Use for browser-task taxonomy, then build private suites around target URLs, expected state, and visual proof.
Desktop computer-use agents
OSWorld
Use as the computer-use lane inside a broader agent reliability index with browser and workflow holdouts.
Tool-agent-user customer workflows
tau-bench
Use as the support-agent lane, then add Indian policy, language, refund, escalation, and CRM-specific cases.
Mechanistic playbook
From behavior failure to causal audit finding.
A mechanistic audit finding should include behavior evidence, candidate feature/component evidence, a causal intervention, counterexamples, and a deployment decision boundary.
8 min read
Sparse Autoencoders
Which recurring features activate during the failure, refusal, shortcut, or policy behavior?
8 min read
Activation Patching
Does changing the suspected feature, layer, head, or residual stream state change the model answer in the predicted direction?
8 min read
Feature Dashboards
Can a reviewer see when policy, shortcut, refusal, sensitive-domain, or evidence-use features activate?
8 min read
Circuit Tracing
What path connects input features, intermediate features, and the final behavior the audit cares about?
8 min read
Audit Limitations
What would make this interpretation fail, and what decision is still unsupported?
Editorial publishing kit
Article source files, visual sidecars, source indexes, and publish checklists are generated into reviewable repository artifacts.
7 notes
Articles now have generated editorial metadata
7 research notes are tracked with dates, owners, reviewers, source counts, visual modules, and next editorial actions.
7 visual sidecars
A working kit for publishing the next research note
The site now ships a file-backed article template, source registry, article index, checklist, and public report bundle for new drafts.
7 source files
A bridge from code-backed articles to file-backed publishing
Article bodies live in Markdown while visual data lives in JSON sidecars, so research pages can stay visual without hiding the source trail.
Stay file-backed Markdown/JSON for the next publishing phase.
Stay file-backed Markdown/JSON
The current system already provides version control, reviewable diffs, generated public drafts, source registries, article indexes, visual sidecars, and Command-K discovery without adding CMS tokens, preview routes, editor permissions, or live-content caching complexity.