# Hardware Procurement Matrix

Procurement matrix for deciding when an AI buyer should stay on managed APIs, test hosted open-weight inference, reserve dedicated endpoints, self-host on cloud GPUs, or buy hardware.

## Decision Gates

| Gate | Buyer question | Evidence needed | Failure if skipped |
| --- | --- | --- | --- |
| Demand shape | Is the workload bursty, steady, interactive, or offline? | Hourly request histogram, p95 latency target, batchable share, concurrency peaks, and idle windows. | Teams reserve expensive capacity for a workload that could have stayed serverless or batched. |
| Utilization proof | Can the buyer keep the accelerator busy enough to beat hosted inference? | Accepted output tokens per second, GPU occupancy, queue depth, autoscaling floor, and peak-to-idle ratio. | A self-hosted route looks cheap on token math but loses money during idle time and incidents. |
| Memory and context fit | Does the model, KV cache, batch size, and context window fit the planned hardware route? | Model size, quantization plan, context length, batch size, KV-cache budget, and observed out-of-memory events. | The selected GPU can run a demo but cannot hold production context, throughput, or concurrent sessions. |
| Operations maturity | Who owns serving, upgrades, observability, security patches, and incident response? | Runbook owner, monitoring plan, rollback route, model upgrade policy, and security review. | The hardware route becomes an unowned platform project instead of a cheaper inference path. |
| Fallback economics | What happens when the open or self-hosted route fails the task or saturates? | Fallback model, routing threshold, retry policy, review minutes, and fallback cost per accepted output. | Savings disappear because hard cases silently route through a premium model without being measured. |

## Route Choices

| Route | Stage | Use when | Minimum evidence | Avoid when | Owner |
| --- | --- | --- | --- | --- | --- |
| Managed API only | Baseline | The buyer needs speed, high-quality reasoning, uncertain demand, and minimal infrastructure ownership. | Private task pass rate, cost per accepted output, p95 latency, and retry/review fallout. | The workload is steady, privacy-constrained, or large enough that dedicated capacity can be proven. | Product and AI lead |
| Hosted open-weight serverless | Optionality | Teams want to test Mistral, Qwen, DeepSeek, Llama, or specialist models without buying capacity. | Model fit by task class, token price, cold-start behavior, cache policy, and acceptance rate. | Tail latency and provider variability break the workflow or procurement needs a fixed data boundary. | AI engineer |
| Dedicated hosted endpoint | Production lane | Traffic is predictable enough to reserve replicas, and latency/governance need tighter control. | Seven-day traffic shape, autoscaling floor, utilization, p95/p99 latency, and fallback route. | Utilization is weak, model choice is still changing weekly, or incident ownership is unclear. | Platform and operations |
| Cloud GPU self-hosting | Control lane | The buyer needs custom serving, data control, fine-tuned models, or high-utilization open-weight inference. | GPU-hour cost, serving stack benchmark, occupancy, queue depth, upgrade runbook, and security review. | The team is still learning which model solves the workflow or cannot monitor serving reliably. | Infrastructure owner |
| Owned hardware | Strategic capacity | Workload volume, data boundary, procurement horizon, and engineering maturity justify long-lived capacity. | Three-month demand forecast, depreciation model, power/cooling plan, support contract, and fallback API. | Demand is speculative, model architecture is volatile, or utilization cannot be audited continuously. | Executive sponsor and infrastructure |

## Readiness Scores

- **Demand forecast:** 82/100. Enough traffic evidence to compare hosted, dedicated, and self-hosted routes.
- **Utilization model:** 68/100. Needs measured occupancy from real provider or GPU traces before buying capacity.
- **Ops ownership:** 56/100. Serving, rollback, upgrade, and incident ownership must be explicit before self-hosting.
- **Fallback design:** 74/100. Route hard cases back to a managed model and report blended cost per accepted output.
- **Procurement clarity:** 61/100. Security, residency, vendor support, and budget horizon still need buyer-specific input.