Articles and reports

Research notes for choosing and deploying AI models.

Benchmarks, model economics, mechanistic interpretability, and Indian enterprise workflow analysis in one editorial system.

Featured note

Building a Useful AI Leaderboard Without Fooling Ourselves

A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.

Read article

Sources

Visual modules

Reading time

12 min read

Benchmarks

15 May 2026

Building a Useful AI Leaderboard Without Fooling Ourselves

A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.

12 min readPublished draft

Mechanistic Interpretability

16 May 2026

Mechanistic Interpretability for Operators, Not Mystics

A practical explanation of circuits, sparse autoencoders, feature dashboards, and why interpretability should become part of AI system audits.

14 min readPublished draft

Model Economics

17 May 2026

Cost Curves for Frontier Reasoning Models

A buyer's framework for converting token prices into workflow cost, including reasoning tokens, cache hits, batch discounts, and tool calls.

13 min readPublished draft

Agents

18 May 2026

Agent Benchmarks That Survive Real Work

A map of coding, terminal, browser, OS, and customer-support agent benchmarks, and what each misses when used alone.

15 min readPublished draft

Model Economics

19 May 2026

Prompt Caching, Batch API, and the Real AI Cost Stack

How prompt caching, batch execution, retries, tool calls, and review loops change the economics of production AI workloads.

11 min readPublished draft

Indian Workflows

20 May 2026

Designing the Indian Enterprise AI Workflow Benchmark

A concrete v0.1 benchmark plan for Indian business workflows: data shape, scoring, task refresh, and buyer-facing outputs.

16 min readPublished draft

Model Economics

21 May 2026

Open-Weight Inference Economics for Enterprise AI

A buyer's map for deciding when Mistral, DeepSeek, Qwen, and other open-weight routes belong beside managed frontier APIs.

17 min readPublished draft

Mechanistic Interpretability

22 May 2026

Natural Language Autoencoders for Tiny Qwen

How Anthropic's Natural Language Autoencoder idea changes mechanistic interpretability, what it does not solve, and how to build a small local Qwen 0.5B activation-reading demo without pretending it is a trained NLA.

12 min readPublished draft

Research Evidence Library

Source trails, proof gaps, and review lanes.

Source-linked evidence library for Edxperimental Labs research articles and visual essays.

Source trails

Claim reviews

Chart tables

Source packets

Every major article claim gets a proof lane.

The claim matrix tracks weakest links, proof required before publishing, and the next artifact needed before a research note is treated as buyer-facing evidence.

Source notes that say what each link can and cannot prove.

31 source packets now include excerpt-level notes, cautions, freshness checks, and visual encoding hints for article charts.

Deep research stays useful only when the missing proof is visible.

Proof-gap rows separate what the draft can argue today from what still needs a benchmark run, source refresh, client trace, or reviewer sign-off.

Dates, method notes, and chart-ready tables.

The library has 31 bibliography rows and 3 chart-ready tables for the visual essays.

Evaluation operations

Agent benchmarks need an operating system, not only a score.

The evidence library converts benchmark papers into artifacts Edxperimental Labs can actually run: traces, holdouts, calibration memos, reviewer packets, failure taxonomies, and procurement notes.

UK AISI Inspect evaluation framework

Freeze the eval manifest

/reports/benchmark-harness-kit/adapter-spec.json

UK AISI Inspect Evals

Reuse open eval patterns carefully

/reports/research-evidence-library.md

UK AISI autonomous systems evaluation standard

Attach a threat model to agent autonomy

/reports/benchmark-harness-kit/runbook.md

NIST AI Risk Management Framework 1.0

Translate evidence into risk ownership

/roadmap#launch-control

OWASP Top 10 for LLM Applications

Score LLM application security explicitly

/reports/benchmark-evidence-readiness/readiness.md

MITRE ATLAS

Name adversarial techniques

/reports/benchmark-harness-kit/support-agent-harness.md

Agent benchmark map

Six benchmark families by operating surface.

Reliability index

Repository-level coding agents

SWE-bench

Use as one coding-agent slice, then add repository-specific review hygiene, visual QA, and deployment checks.

Shell and command-line agents

Terminal-Bench

Pair with coding tasks to separate code reasoning from terminal discipline and recovery behavior.

Browser-agent research ecosystem

BrowserGym

Use as a harness reference for browser suites, then add buyer-specific proof requirements and screenshot checks.

Self-hosted web environments

WebArena

Use for browser-task taxonomy, then build private suites around target URLs, expected state, and visual proof.

Desktop computer-use agents

OSWorld

Use as the computer-use lane inside a broader agent reliability index with browser and workflow holdouts.

Tool-agent-user customer workflows

tau-bench

Use as the support-agent lane, then add Indian policy, language, refund, escalation, and CRM-specific cases.

Mechanistic playbook

From behavior failure to causal audit finding.

A mechanistic audit finding should include behavior evidence, candidate feature/component evidence, a causal intervention, counterexamples, and a deployment decision boundary.

8 min read

Sparse Autoencoders

Which recurring features activate during the failure, refusal, shortcut, or policy behavior?

8 min read

Activation Patching

Does changing the suspected feature, layer, head, or residual stream state change the model answer in the predicted direction?

8 min read

Feature Dashboards

Can a reviewer see when policy, shortcut, refusal, sensitive-domain, or evidence-use features activate?

8 min read

Circuit Tracing

What path connects input features, intermediate features, and the final behavior the audit cares about?

8 min read

Audit Limitations

What would make this interpretation fail, and what decision is still unsupported?

Editorial publishing kit

Article source files, visual sidecars, source indexes, and publish checklists are generated into reviewable repository artifacts.

8 notes

Articles now have generated editorial metadata

8 research notes are tracked with dates, owners, reviewers, source counts, visual modules, and next editorial actions.

7 visual sidecars

A working kit for publishing the next research note

The site now ships a file-backed article template, source registry, article index, checklist, and public report bundle for new drafts.

7 source files

A bridge from code-backed articles to file-backed publishing

Article bodies live in Markdown while visual data lives in JSON sidecars, so research pages can stay visual without hiding the source trail.

Stay file-backed Markdown/JSON for the next publishing phase.

Stay file-backed Markdown/JSON

The current system already provides version control, reviewable diffs, generated public drafts, source registries, article indexes, visual sidecars, and Command-K discovery without adding CMS tokens, preview routes, editor permissions, or live-content caching complexity.