Route
Managed frontier API
High-stakes reasoning, low ops burden, fastest path to production evidence.
Nominal token price can be cheaper than self-hosting when quality reduces retries and review.
Models, agents and hardware
Models do not ship alone. Production systems combine models, agents, tools, retrieval, browsers, GPUs, API providers, cache policy, and monitoring. This page keeps that stack in one place.
Inference Economics Playbook
Inference economics playbook for comparing managed APIs, hosted open-weight inference, dedicated endpoints, and self-hosted GPU stacks.
Cost per accepted output at target p95 latency, reported with cache hit rate, batch eligibility, retry rate, utilization, and fallback route.
Route
High-stakes reasoning, low ops burden, fastest path to production evidence.
Nominal token price can be cheaper than self-hosting when quality reduces retries and review.
Route
Testing Mistral, Qwen, DeepSeek, Llama, and other open models without owning GPUs.
Provider/model choice, batching, and replica-local cache behavior can dominate the final curve.
Route
Predictable production traffic, data boundary control, and more consistent latency.
A dedicated endpoint can be more expensive than serverless if utilization stays low.
Route
Strict data control, high utilization, custom kernels, fine-tuned models, and deep optimization.
The real cost includes upgrades, observability, security, model serving, and incident response.
Cost levers
Inference route choice should be proven with workload traces: token ledgers, latency percentiles, queue behavior, utilization, and cost per accepted output.
Batch eligibility
Can this workflow wait minutes or hours instead of responding live?
Queue delay tolerance, SLA class, batch endpoint availability, accepted-output score at batch settings.
Prompt cache shape
Which policy, schema, instruction, or retrieval blocks repeat across requests?
Stable prefix design, cache hit rate, cache-read/write pricing, session affinity behavior.
Throughput and utilization
How many accepted tokens per second does the system produce at target latency?
Tokens/sec/user, requests/sec, GPU occupancy, queue depth, p50/p95/p99 latency.
Queueing and tail latency
What happens at peak load, long context, and retry storms?
p95/p99 latency, timeout rate, retry count, queue wait, rate-limit behavior.
Quality-adjusted cost
What is the cost per accepted workflow after retries and human review?
Pass rate, review minutes, failure severity, fallback route, cost per accepted output.
Inference Trace Kit
Import contract for measured latency, throughput, cache, batch, cost, and acceptance traces across model/provider routes. This is the bridge from provider logs and notebooks into the cost-curve articles, Studio workbench, and buyer reports.
16
Trace fields
5
Derived metrics
A cheap model is not cheap if it creates rejected outputs or review fallout.
Interactive agents fail on slow tails, not average latency.
Repeated policy and schema prefixes should visibly bend the cost curve.
Offline work should not be priced like live chat if the product can wait.
Dedicated or self-hosted routes only make sense when utilization can stay high.
Hardware Procurement Matrix
Procurement matrix for deciding when an AI buyer should stay on managed APIs, test hosted open-weight inference, reserve dedicated endpoints, self-host on cloud GPUs, or buy hardware. The decision is not API versus hardware; it is which route survives demand shape, utilization, memory fit, operations ownership, and fallback economics.
Enough traffic evidence to compare hosted, dedicated, and self-hosted routes.
Needs measured occupancy from real provider or GPU traces before buying capacity.
Serving, rollback, upgrade, and incident ownership must be explicit before self-hosting.
Route hard cases back to a managed model and report blended cost per accepted output.
Security, residency, vendor support, and budget horizon still need buyer-specific input.
Is the workload bursty, steady, interactive, or offline?
Hourly request histogram, p95 latency target, batchable share, concurrency peaks, and idle windows.
Can the buyer keep the accelerator busy enough to beat hosted inference?
Accepted output tokens per second, GPU occupancy, queue depth, autoscaling floor, and peak-to-idle ratio.
Does the model, KV cache, batch size, and context window fit the planned hardware route?
Model size, quantization plan, context length, batch size, KV-cache budget, and observed out-of-memory events.
Who owns serving, upgrades, observability, security patches, and incident response?
Runbook owner, monitoring plan, rollback route, model upgrade policy, and security review.
What happens when the open or self-hosted route fails the task or saturates?
Fallback model, routing threshold, retry policy, review minutes, and fallback cost per accepted output.
Models
Track frontier, mini, and open-weight models by reasoning quality, instruction following, multilingual robustness, context length, latency, and deployment constraints.
Agents
Treat agents as systems built on models, tools, memory, browsers, terminals, retrieval, policies, and verification loops.
Hardware
Compare GPUs, inference providers, local deployment, cache strategy, batch processing, and throughput bottlenecks by workload.