Models, agents and hardware

One technical map for the AI stack buyers actually choose.

Models do not ship alone. Production systems combine models, agents, tools, retrieval, browsers, GPUs, API providers, cache policy, and monitoring. This page keeps that stack in one place.

Inference Economics Playbook

Route the workload before chasing the cheapest token.

Inference economics playbook for comparing managed APIs, hosted open-weight inference, dedicated endpoints, and self-hosted GPU stacks.

Cost per accepted output at target p95 latency, reported with cache hit rate, batch eligibility, retry rate, utilization, and fallback route.

Markdown playbook JSON playbook

Route

Managed frontier API

High-stakes reasoning, low ops burden, fastest path to production evidence.

Nominal token price can be cheaper than self-hosting when quality reduces retries and review.

Route

Hosted open-weight serverless

Testing Mistral, Qwen, DeepSeek, Llama, and other open models without owning GPUs.

Provider/model choice, batching, and replica-local cache behavior can dominate the final curve.

Route

Dedicated endpoint

Predictable production traffic, data boundary control, and more consistent latency.

A dedicated endpoint can be more expensive than serverless if utilization stays low.

Route

Self-hosted GPU stack

Strict data control, high utilization, custom kernels, fine-tuned models, and deep optimization.

The real cost includes upgrades, observability, security, model serving, and incident response.

Cost levers

Batch, cache, throughput, queueing, and quality all move the curve.

Inference route choice should be proven with workload traces: token ledgers, latency percentiles, queue behavior, utilization, and cost per accepted output.

Batch eligibility

Can this workflow wait minutes or hours instead of responding live?

Queue delay tolerance, SLA class, batch endpoint availability, accepted-output score at batch settings.

Prompt cache shape

Which policy, schema, instruction, or retrieval blocks repeat across requests?

Stable prefix design, cache hit rate, cache-read/write pricing, session affinity behavior.

Throughput and utilization

How many accepted tokens per second does the system produce at target latency?

Tokens/sec/user, requests/sec, GPU occupancy, queue depth, p50/p95/p99 latency.

Queueing and tail latency

What happens at peak load, long context, and retry storms?

p95/p99 latency, timeout rate, retry count, queue wait, rate-limit behavior.

Quality-adjusted cost

What is the cost per accepted workflow after retries and human review?

Pass rate, review minutes, failure severity, fallback route, cost per accepted output.

Inference Trace Kit

A measured route ledger for latency, throughput, cache, batch, and cost.

Import contract for measured latency, throughput, cache, batch, cost, and acceptance traces across model/provider routes. This is the bridge from provider logs and notebooks into the cost-curve articles, Studio workbench, and buyer reports.

Trace fields

Derived metrics

Runbook CSV template JSON schema

Cost per accepted output

A cheap model is not cheap if it creates rejected outputs or review fallout.

p95 latency at concurrency

Interactive agents fail on slow tails, not average latency.

Cache-adjusted input burden

Repeated policy and schema prefixes should visibly bend the cost curve.

Batch displacement share

Offline work should not be priced like live chat if the product can wait.

Throughput headroom

Dedicated or self-hosted routes only make sense when utilization can stay high.

RouteWorkloadp95AcceptCost

Managed frontier APIcoding-agent-maintenance28,200 ms84%$0.215

Hosted open-weight serverlessinvoice-review14,900 ms72%$0.046

Dedicated endpointsupport-policy8,800 ms78%$0.032

Hardware Procurement Matrix

Do not buy GPUs until the workload shape proves it.

Procurement matrix for deciding when an AI buyer should stay on managed APIs, test hosted open-weight inference, reserve dedicated endpoints, self-host on cloud GPUs, or buy hardware. The decision is not API versus hardware; it is which route survives demand shape, utilization, memory fit, operations ownership, and fallback economics.

Demand forecast82/100

Enough traffic evidence to compare hosted, dedicated, and self-hosted routes.

Utilization model68/100

Needs measured occupancy from real provider or GPU traces before buying capacity.

Ops ownership56/100

Serving, rollback, upgrade, and incident ownership must be explicit before self-hosting.

Fallback design74/100

Route hard cases back to a managed model and report blended cost per accepted output.

Procurement clarity61/100

Security, residency, vendor support, and budget horizon still need buyer-specific input.

Matrix brief Matrix JSON

Demand shape

Is the workload bursty, steady, interactive, or offline?

Hourly request histogram, p95 latency target, batchable share, concurrency peaks, and idle windows.

Utilization proof

Can the buyer keep the accelerator busy enough to beat hosted inference?

Accepted output tokens per second, GPU occupancy, queue depth, autoscaling floor, and peak-to-idle ratio.

Memory and context fit

Does the model, KV cache, batch size, and context window fit the planned hardware route?

Model size, quantization plan, context length, batch size, KV-cache budget, and observed out-of-memory events.

Operations maturity

Who owns serving, upgrades, observability, security patches, and incident response?

Runbook owner, monitoring plan, rollback route, model upgrade policy, and security review.

Fallback economics

What happens when the open or self-hosted route fails the task or saturates?

Fallback model, routing threshold, retry policy, review minutes, and fallback cost per accepted output.

RouteStageUse whenMinimum evidence

Managed API onlyBaselineThe buyer needs speed, high-quality reasoning, uncertain demand, and minimal infrastructure ownership.Private task pass rate, cost per accepted output, p95 latency, and retry/review fallout.

Hosted open-weight serverlessOptionalityTeams want to test Mistral, Qwen, DeepSeek, Llama, or specialist models without buying capacity.Model fit by task class, token price, cold-start behavior, cache policy, and acceptance rate.

Dedicated hosted endpointProduction laneTraffic is predictable enough to reserve replicas, and latency/governance need tighter control.Seven-day traffic shape, autoscaling floor, utilization, p95/p99 latency, and fallback route.

Cloud GPU self-hostingControl laneThe buyer needs custom serving, data control, fine-tuned models, or high-utilization open-weight inference.GPU-hour cost, serving stack benchmark, occupancy, queue depth, upgrade runbook, and security review.

Owned hardwareStrategic capacityWorkload volume, data boundary, procurement horizon, and engineering maturity justify long-lived capacity.Three-month demand forecast, depreciation model, power/cooling plan, support contract, and fallback API.

Models

Language and reasoning models

Track frontier, mini, and open-weight models by reasoning quality, instruction following, multilingual robustness, context length, latency, and deployment constraints.

Frontier reasoning Small/fast models Open-weight local models Multimodal assistants

Agents

Agent systems

Treat agents as systems built on models, tools, memory, browsers, terminals, retrieval, policies, and verification loops.

Agent benchmarks Coding agents Browser agents Customer support agents

Hardware

Inference and hardware

Compare GPUs, inference providers, local deployment, cache strategy, batch processing, and throughput bottlenecks by workload.

GPU inference On-device AI Provider latency Cost per workload

LayerPrimary questionFailure modeEdxperimental output

ModelCan it solve the task?Overfitting to public benchmarksModel shortlist and eval report

AgentCan it complete work?Tool loops, brittle browser behaviorAgent benchmark and run traces

Hardware/APICan it run within budget?Latency variance and hidden token costsCost curve and provider recommendation