What drives the cost of common features
Most LLM features fall into a handful of shapes, each concentrating cost in a different place. Knowing where tells you which lever to pull — and when a cheaper tier is good enough.
The sticker price — input and output per million tokens — is the same for every feature you build. What changes is the shape of the workload: how much you send, how much comes back, how often the same context repeats, and whether the task is hard enough to need a capable model. Two features on the same model can differ by 100× in cost per call purely because of shape.
Five common shapes. For each: a worked example, where cost concentrates, the dominant lever, and the cheaper tier that often does the job. Tier calls are a cost comparison, not a quality verdict — validate against your own eval before you ship.
Output costs 4.0× input, on average
Output costs more than input across every provider. Across 280 models the multiple ranges from 0.1× to 12.2×.
Live from the index — the per-1M spread every shape below is multiplied against.
RAG / question-over-documents
You retrieve a handful of documents, send them as context, and the model answers in a paragraph. The retrieval does the hard part; the model mostly reads carefully and declines to invent. This shape is overwhelmingly input-heavy — thousands of tokens go in to get a couple of hundred back.
Where the money concentrates: context, by a wide margin. The single most effective lever is retrieval precision — sending three relevant documents instead of ten cuts input cost more than any model swap. If the system prompt and document set repeat across requests, cached input (typically 50–90% off) changes which model comes out cheapest. Tier that often suffices: mid. The retrieval carries the reasoning load, so a frontier model rarely earns its premium here.
Multi-turn chat
A conversation that accumulates. Each turn re-sends the entire history so the model has context, which means the input grows with every exchange — and you pay to reprocess the whole transcript on every single turn. A short reply is cheap; the conversation behind it is not.
Where the money concentrates: re-sent history. Cost grows roughly with the square of conversation length if you do nothing. The levers: cached input (the stable prefix of the transcript is highly cacheable — the single biggest win for chat), trimming or summarising old turns once they stop being relevant, and capping context length. Tier that often suffices: small to mid for general chat; reserve frontier for genuinely hard reasoning threads.
Classification and extraction
A document in, a label or a small JSON object out. Moderation, routing, tagging, pulling structured fields from messy text. The output is tiny — sometimes a single token — so the usual input/output balance collapses: output is a rounding error and model choice dominates everything.
Where the money concentrates: the per-token input rate, multiplied by volume. Because output is negligible, the only lever that moves the bill is tier choice — the classic shape where a small model is purpose-built for the job. Watch the real metric, though: cost per accepted result, not cost per call. A cheap model that mislabels and forces human review isn't cheap. Tier that often suffices: small. These jobs are also rarely latency-sensitive, making them prime candidates for batch pricing on top.
Summarisation
The most input-heavy shape there is: a long document in, a short summary out. Tier choice gets multiplied by every page you feed in, so the input rate matters more here than almost anywhere else. Output is modest and well-bounded.
Where the money concentrates: the input meter, scaled by document length. The dominant lever is tier choice on input price, followed by not over-sending — strip boilerplate, headers, and repeated front-matter before the document goes in. Cheap long-context models have made routine summarisation a small-tier task; pay for mid or above only when missing a clause buried mid-document actually carries a cost. Tier that often suffices: small to mid.
Coding agents
This is the shape that breaks the pattern. An agent reads a codebase, plans, edits, runs tools, reads the results, and tries again — across many steps. Three cost drivers stack up at once: long context (the codebase and growing history), tool loops (every step is another round-trip), and often reasoning tokens (billed as output, frequently many times longer than the visible action).
Where the money concentrates: re-sent context, mostly served from cache, plus reasoning output. The levers, in order: cached input (the dominant line item — two frontier models with the same sticker can differ sharply once caching enters the maths), reasoning-effort control, and delegating sub-tasks to a cheaper tier. Tier that often suffices: frontier — for the main loop. This is the one shape where compounding per-step errors usually justify frontier prices: a model 95% as good per step is far less than 95% as good after thirty steps.
Reading your own feature
Place the feature on two axes. First, the input/output ratio: input-bound features (RAG, summarisation, classification) live or die on input price and context discipline; output-heavy features make output rate the meter to watch. Second, how much context repeats: anything re-sending a stable prefix — chat history, a fixed system prompt, an agent's growing transcript — is a caching opportunity, often the largest single saving available.
The dominant lever usually picks itself from there: tier choice where one meter dwarfs the other, retrieval precision and context trimming where you're sending too much, and caching wherever the same tokens go out twice. These are cost decisions, not quality decisions. Make the cheaper choice, then let an eval tell you whether you can keep it.