Articles and reports

Research notes for choosing and deploying AI models.

Benchmarks, model economics, mechanistic interpretability, and Indian enterprise workflow analysis in one editorial system.

Featured note

Building a Useful AI Leaderboard Without Fooling Ourselves

A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.

Read article

Sources

5

Visual modules

6

Reading time

12 min read

Benchmarks

15 May 2026

Building a Useful AI Leaderboard Without Fooling Ourselves

A field guide for ranking AI systems without being fooled by contamination, brittle prompts, saturated tests, or one-number theatre.

12 min readPublished draft

Mechanistic Interpretability

16 May 2026

Mechanistic Interpretability for Operators, Not Mystics

A practical explanation of circuits, sparse autoencoders, feature dashboards, and why interpretability should become part of AI system audits.

14 min readPublished draft

Model Economics

17 May 2026

Cost Curves for Frontier Reasoning Models

A buyer's framework for converting token prices into workflow cost, including reasoning tokens, cache hits, batch discounts, and tool calls.

13 min readPublished draft

Agents

18 May 2026

Agent Benchmarks That Survive Real Work

A map of coding, terminal, browser, OS, and customer-support agent benchmarks, and what each misses when used alone.

15 min readPublished draft

Model Economics

19 May 2026

Prompt Caching, Batch API, and the Real AI Cost Stack

How prompt caching, batch execution, retries, tool calls, and review loops change the economics of production AI workloads.

11 min readPublished draft

Indian Workflows

20 May 2026

Designing the Indian Enterprise AI Workflow Benchmark

A concrete v0.1 benchmark plan for Indian business workflows: data shape, scoring, task refresh, and buyer-facing outputs.

16 min readPublished draft

Model Economics

21 May 2026

Open-Weight Inference Economics for Enterprise AI

A buyer's map for deciding when Mistral, DeepSeek, Qwen, and other open-weight routes belong beside managed frontier APIs.

17 min readPublished draft

Research Evidence Library

Source trails, proof gaps, and review lanes.

Source-linked evidence library for Edxperimental Labs research articles and visual essays.

31

Source trails

5

Claim reviews

3

Chart tables

31

Source packets

Every major article claim gets a proof lane.

The claim matrix tracks weakest links, proof required before publishing, and the next artifact needed before a research note is treated as buyer-facing evidence.

Source notes that say what each link can and cannot prove.

31 source packets now include excerpt-level notes, cautions, freshness checks, and visual encoding hints for article charts.

Deep research stays useful only when the missing proof is visible.

Proof-gap rows separate what the draft can argue today from what still needs a benchmark run, source refresh, client trace, or reviewer sign-off.

Dates, method notes, and chart-ready tables.

The library has 31 bibliography rows and 3 chart-ready tables for the visual essays.

Evaluation operations

Agent benchmarks need an operating system, not only a score.

The evidence library converts benchmark papers into artifacts Edxperimental Labs can actually run: traces, holdouts, calibration memos, reviewer packets, failure taxonomies, and procurement notes.

UK AISI Inspect evaluation framework

Freeze the eval manifest

/reports/benchmark-harness-kit/adapter-spec.json

UK AISI Inspect Evals

Reuse open eval patterns carefully

/reports/research-evidence-library.md

UK AISI autonomous systems evaluation standard

Attach a threat model to agent autonomy

/reports/benchmark-harness-kit/runbook.md

NIST AI Risk Management Framework 1.0

Translate evidence into risk ownership

/roadmap#launch-control

OWASP Top 10 for LLM Applications

Score LLM application security explicitly

/reports/benchmark-evidence-readiness/readiness.md

MITRE ATLAS

Name adversarial techniques

/reports/benchmark-harness-kit/support-agent-harness.md