Learn › Feature cost shapes · June 2026

What drives the cost of common features

Most LLM features fall into a handful of shapes, each concentrating cost in a different place. Knowing where tells you which lever to pull — and when a cheaper tier is good enough.

The sticker price — input and output per million tokens — is the same for every feature you build. What changes is the shape of the workload: how much you send, how much comes back, how often the same context repeats, and whether the task is hard enough to need a capable model. Two features on the same model can differ by 100× in cost per call purely because of shape.

Five common shapes. For each: a worked example, where cost concentrates, the dominant lever, and the cheaper tier that often does the job. Tier calls are a cost comparison, not a quality verdict — validate against your own eval before you ship.

Live · prices today

Output costs 4.0× input, on average

Output costs more than input across every provider. Across 280 models the multiple ranges from 0.1× to 12.2×.

I Ling-2.6-flash$0.01 in · $0.03 out 3.0× Llama 3.1 8B Instruct$0.02 in · $0.03 out 1.5× Mistral Nemo$0.02 in · $0.03 out 1.5× S Llama 3 8B Lunaris$0.04 in · $0.05 out 1.3× G MythoMax 13B$0.06 in · $0.06 out 1.0× I Granite 4.0 Micro$0.017 in · $0.112 out 6.6×

input / 1M output / 1M per 1M tokens · tap a row for its history

Live from the index — the per-1M spread every shape below is multiplied against.

RAG / question-over-documents

You retrieve a handful of documents, send them as context, and the model answers in a paragraph. The retrieval does the hard part; the model mostly reads carefully and declines to invent. This shape is overwhelmingly input-heavy — thousands of tokens go in to get a couple of hundred back.

Cost breakdown — single RAG query

System prompt500 tokens

4 retrieved docs4,000 tokens

User question50 tokens

Total input4,550 tokens

Answer250 tokens (output)

Input cost 4,550 × $1.00 / 1M$0.00455

Output cost 250 × $5.00 / 1M$0.00125

Total per query$0.0058

Input is ~78% of the bill. Assumes a mid-tier model at $1 in / $5 out; your numbers will differ — this is the shape, not a quote.

Where the money concentrates: context, by a wide margin. The single most effective lever is retrieval precision — sending three relevant documents instead of ten cuts input cost more than any model swap. If the system prompt and document set repeat across requests, cached input (typically 50–90% off) changes which model comes out cheapest. Tier that often suffices: mid. The retrieval carries the reasoning load, so a frontier model rarely earns its premium here.

Multi-turn chat

A conversation that accumulates. Each turn re-sends the entire history so the model has context, which means the input grows with every exchange — and you pay to reprocess the whole transcript on every single turn. A short reply is cheap; the conversation behind it is not.

Cost over a 10-turn conversation

System prompt (sent every turn)400 tokens

Per-turn user + assistant text~300 tokens

Turn 1 input~550 tokens

Turn 10 input~3,100 tokens

Cumulative input (all 10 turns)~18,000 tokens

Cumulative output~1,500 tokens

Conversation cost @ $1 / $5 per 1M$0.0255

The same 10 turns with no history accumulation would cost a fraction of this. History is the meter — the late turns cost several times what the early ones did.

Where the money concentrates: re-sent history. Cost grows roughly with the square of conversation length if you do nothing. The levers: cached input (the stable prefix of the transcript is highly cacheable — the single biggest win for chat), trimming or summarising old turns once they stop being relevant, and capping context length. Tier that often suffices: small to mid for general chat; reserve frontier for genuinely hard reasoning threads.

Classification and extraction

A document in, a label or a small JSON object out. Moderation, routing, tagging, pulling structured fields from messy text. The output is tiny — sometimes a single token — so the usual input/output balance collapses: output is a rounding error and model choice dominates everything.

Same classification call — small vs frontier tier

Document in800 tokens

Label out3 tokens

Small tier 803 tok @ ~$0.10 in~$0.00008

Frontier tier 803 tok @ ~$3.00 in~$0.00241

Cost difference~30×

Output is so small it barely registers. At 1M documents/day that's the difference between ~$80 and ~$2,400 — every day — for the same task.

Where the money concentrates: the per-token input rate, multiplied by volume. Because output is negligible, the only lever that moves the bill is tier choice — the classic shape where a small model is purpose-built for the job. Watch the real metric, though: cost per accepted result, not cost per call. A cheap model that mislabels and forces human review isn't cheap. Tier that often suffices: small. These jobs are also rarely latency-sensitive, making them prime candidates for batch pricing on top.

Summarisation

The most input-heavy shape there is: a long document in, a short summary out. Tier choice gets multiplied by every page you feed in, so the input rate matters more here than almost anywhere else. Output is modest and well-bounded.

Summarise a 30-page report

Document in22,000 tokens

Summary out600 tokens

Input cost 22,000 × $1.00 / 1M$0.022

Output cost 600 × $5.00 / 1M$0.003

Total per summary$0.025

Input is ~88% of the bill. Move to a frontier model at $3 in and the same job costs ~$0.069 — the document gets repriced page by page.

Where the money concentrates: the input meter, scaled by document length. The dominant lever is tier choice on input price, followed by not over-sending — strip boilerplate, headers, and repeated front-matter before the document goes in. Cheap long-context models have made routine summarisation a small-tier task; pay for mid or above only when missing a clause buried mid-document actually carries a cost. Tier that often suffices: small to mid.

Coding agents

This is the shape that breaks the pattern. An agent reads a codebase, plans, edits, runs tools, reads the results, and tries again — across many steps. Three cost drivers stack up at once: long context (the codebase and growing history), tool loops (every step is another round-trip), and often reasoning tokens (billed as output, frequently many times longer than the visible action).

One agent task — 15 tool-loop steps

Context at step 1~8,000 tokens

Context at step 15 (accumulated)~40,000 tokens

Cumulative input across steps~360,000 tokens

Output + reasoning across steps~30,000 tokens

Cost per task (frontier, cached)~$0.30–$1.50

The same context is re-sent on every step, so most of the input is cache reads — cached-input price matters more here than the headline input rate. Without caching, the same task can cost several times as much.

Where the money concentrates: re-sent context, mostly served from cache, plus reasoning output. The levers, in order: cached input (the dominant line item — two frontier models with the same sticker can differ sharply once caching enters the maths), reasoning-effort control, and delegating sub-tasks to a cheaper tier. Tier that often suffices: frontier — for the main loop. This is the one shape where compounding per-step errors usually justify frontier prices: a model 95% as good per step is far less than 95% as good after thirty steps.

Reading your own feature

Place the feature on two axes. First, the input/output ratio: input-bound features (RAG, summarisation, classification) live or die on input price and context discipline; output-heavy features make output rate the meter to watch. Second, how much context repeats: anything re-sending a stable prefix — chat history, a fixed system prompt, an agent's growing transcript — is a caching opportunity, often the largest single saving available.

The dominant lever usually picks itself from there: tier choice where one meter dwarfs the other, retrieval precision and context trimming where you're sending too much, and caching wherever the same tokens go out twice. These are cost decisions, not quality decisions. Make the cheaper choice, then let an eval tell you whether you can keep it.

Estimate your feature's cost → Cost-cutting strategies