# Inference Economics Playbook

Inference economics playbook for comparing managed APIs, hosted open-weight inference, dedicated endpoints, and self-hosted GPU stacks.

Recommended metric: Cost per accepted output at target p95 latency, reported with cache hit rate, batch eligibility, retry rate, utilization, and fallback route.

## Route Map

| Route | Best for | Main watchout |
| --- | --- | --- |
| Managed frontier API | High-stakes reasoning, low ops burden, fastest path to production evidence. | Nominal token price can be cheaper than self-hosting when quality reduces retries and review. |
| Hosted open-weight serverless | Testing Mistral, Qwen, DeepSeek, Llama, and other open models without owning GPUs. | Provider/model choice, batching, and replica-local cache behavior can dominate the final curve. |
| Dedicated endpoint | Predictable production traffic, data boundary control, and more consistent latency. | A dedicated endpoint can be more expensive than serverless if utilization stays low. |
| Self-hosted GPU stack | Strict data control, high utilization, custom kernels, fine-tuned models, and deep optimization. | The real cost includes upgrades, observability, security, model serving, and incident response. |

## Cost Levers

| Lever | Buyer question | Evidence |
| --- | --- | --- |
| Batch eligibility | Can this workflow wait minutes or hours instead of responding live? | Queue delay tolerance, SLA class, batch endpoint availability, accepted-output score at batch settings. |
| Prompt cache shape | Which policy, schema, instruction, or retrieval blocks repeat across requests? | Stable prefix design, cache hit rate, cache-read/write pricing, session affinity behavior. |
| Throughput and utilization | How many accepted tokens per second does the system produce at target latency? | Tokens/sec/user, requests/sec, GPU occupancy, queue depth, p50/p95/p99 latency. |
| Queueing and tail latency | What happens at peak load, long context, and retry storms? | p95/p99 latency, timeout rate, retry count, queue wait, rate-limit behavior. |
| Quality-adjusted cost | What is the cost per accepted workflow after retries and human review? | Pass rate, review minutes, failure severity, fallback route, cost per accepted output. |

## Sources

- [NVIDIA Blackwell MLPerf inference results](https://blogs.nvidia.com/blog/blackwell-mlperf-inference/): Rack-scale inference performance and throughput can change materially with hardware generation and serving software.
- [NVIDIA Blackwell Ultra MLPerf inference](https://blogs.nvidia.com/blog/mlperf-inference-blackwell-ultra/): Reasoning inference throughput is an active benchmark surface, not a fixed property of a GPU alone.
- [Together AI pricing](https://www.together.ai/pricing): Hosted inference surfaces serverless, dedicated, and batch economics that should be compared by workload class.
- [Together AI Batch Inference docs](https://docs.together.ai/docs/inference/batch): Batch is appropriate for offline classification, evals, generation, and summarization, not interactive work.
- [Fireworks pricing](https://fireworks.ai/pricing): Serverless token pricing and on-demand deployment pricing create distinct economics for open-model routes.
- [Fireworks serverless overview](https://docs.fireworks.ai/serverless/overview): Prompt caching and affinity behavior can affect effective cost and latency for repeated prefixes.

## Hardware Procurement Matrix

Procurement matrix for deciding when an AI buyer should stay on managed APIs, test hosted open-weight inference, reserve dedicated endpoints, self-host on cloud GPUs, or buy hardware.

Decision gates:
- **Demand shape:** Is the workload bursty, steady, interactive, or offline? Evidence: Hourly request histogram, p95 latency target, batchable share, concurrency peaks, and idle windows.
- **Utilization proof:** Can the buyer keep the accelerator busy enough to beat hosted inference? Evidence: Accepted output tokens per second, GPU occupancy, queue depth, autoscaling floor, and peak-to-idle ratio.
- **Memory and context fit:** Does the model, KV cache, batch size, and context window fit the planned hardware route? Evidence: Model size, quantization plan, context length, batch size, KV-cache budget, and observed out-of-memory events.
- **Operations maturity:** Who owns serving, upgrades, observability, security patches, and incident response? Evidence: Runbook owner, monitoring plan, rollback route, model upgrade policy, and security review.
- **Fallback economics:** What happens when the open or self-hosted route fails the task or saturates? Evidence: Fallback model, routing threshold, retry policy, review minutes, and fallback cost per accepted output.

Route choices:

| Route | Stage | Use when | Minimum evidence | Avoid when |
| --- | --- | --- | --- | --- |
| Managed API only | Baseline | The buyer needs speed, high-quality reasoning, uncertain demand, and minimal infrastructure ownership. | Private task pass rate, cost per accepted output, p95 latency, and retry/review fallout. | The workload is steady, privacy-constrained, or large enough that dedicated capacity can be proven. |
| Hosted open-weight serverless | Optionality | Teams want to test Mistral, Qwen, DeepSeek, Llama, or specialist models without buying capacity. | Model fit by task class, token price, cold-start behavior, cache policy, and acceptance rate. | Tail latency and provider variability break the workflow or procurement needs a fixed data boundary. |
| Dedicated hosted endpoint | Production lane | Traffic is predictable enough to reserve replicas, and latency/governance need tighter control. | Seven-day traffic shape, autoscaling floor, utilization, p95/p99 latency, and fallback route. | Utilization is weak, model choice is still changing weekly, or incident ownership is unclear. |
| Cloud GPU self-hosting | Control lane | The buyer needs custom serving, data control, fine-tuned models, or high-utilization open-weight inference. | GPU-hour cost, serving stack benchmark, occupancy, queue depth, upgrade runbook, and security review. | The team is still learning which model solves the workflow or cannot monitor serving reliably. |
| Owned hardware | Strategic capacity | Workload volume, data boundary, procurement horizon, and engineering maturity justify long-lived capacity. | Three-month demand forecast, depreciation model, power/cooling plan, support contract, and fallback API. | Demand is speculative, model architecture is volatile, or utilization cannot be audited continuously. |

## Inference Trace Kit

Import contract for measured latency, throughput, cache, batch, cost, and acceptance traces across model/provider routes.

Required fields:
- **route:** Managed frontier API, hosted open-weight serverless, dedicated endpoint, or self-hosted GPU stack.
- **provider:** Provider, deployment owner, or local stack name.
- **model:** Exact model identifier, quantization, or endpoint alias.
- **workload:** Task family such as support policy, browser extraction, coding agent, or invoice review.
- **inputTokens:** Observed input tokens for the run.
- **outputTokens:** Observed output tokens for the run.
- **cacheHitRate:** Observed cache hit rate from 0 to 1, or 0 if unavailable.
- **batchEligible:** true if the workload can run through a batch/offline lane.
- **concurrency:** Concurrent request count during the measurement.
- **ttftMs:** Time to first token in milliseconds.
- **latencyP50Ms:** p50 end-to-end latency in milliseconds.
- **latencyP95Ms:** p95 end-to-end latency in milliseconds.
- **outputTokensPerSecond:** Sustained output-token generation rate.
- **acceptedOutputRate:** Share of outputs accepted by the evaluator or reviewer.
- **costUsd:** Observed or modeled cost for this measurement row.
- **reviewMinutes:** Human review minutes per output after model generation.

Derived metrics:
- **Cost per accepted output:** costUsd / acceptedOutputRate + reviewer time cost. A cheap model is not cheap if it creates rejected outputs or review fallout.
- **p95 latency at concurrency:** latencyP95Ms grouped by workload and route. Interactive agents fail on slow tails, not average latency.
- **Cache-adjusted input burden:** inputTokens * (1 - cacheHitRate). Repeated policy and schema prefixes should visibly bend the cost curve.
- **Batch displacement share:** batchEligible rows / all rows by workload. Offline work should not be priced like live chat if the product can wait.
- **Throughput headroom:** outputTokensPerSecond * concurrency at acceptedOutputRate. Dedicated or self-hosted routes only make sense when utilization can stay high.
