Cost-cutting strategies and what they save
Six concrete levers, each with a rough saving and its trade-off. None changes what your feature does.
Cutting an LLM bill is rarely about finding a cheaper model. It's about matching each call to the cheapest way to serve it: caching what repeats, batching what can wait, routing easy work to small models, not paying for output tokens you didn't need. These savings stack — a workload applying three or four of them often runs at a fraction of its naive cost.
All figures state their assumptions. Every cross-tier swap is a cost comparison, not a quality verdict — validate the cheaper path against your own eval before you rely on it.
Output costs 4.0× input, on average
Output costs more than input across every provider. Across 280 models the multiple ranges from 0.1× to 12.2×.
Live from the index — output is the pricey meter these levers target.
Prompt caching 50–90% off cached input
Mechanism: if you send the same prefix on every request — a system prompt, a fixed instruction block, a set of retrieved documents, a few-shot example set — the provider can serve those tokens from a KV cache instead of reprocessing them. You pay a steep discount on the repeated portion.
Rough saving: 50–90% off the cached portion of input (the exact discount and mechanism vary by provider — some cache prefixes automatically, some require you to mark content explicitly). Trade-off: only the stable prefix benefits, so a system prompt or document set that changes per request is a cache miss; caches also expire after a short idle window. When it applies: anything with a repeating prefix — chat, RAG, agents, few-shot prompts.
Batch processing ~50% off
Mechanism: submit work as a non-real-time job and the provider runs it when capacity is free, usually within a few hours. In exchange you get roughly half off both input and output. The discount applies to the whole call, not just a prefix, so it composes with everything else.
Rough saving: ~50% off standard prices at most providers. Trade-off: latency — results arrive in minutes to hours, not seconds, with no firm SLA. When it applies: anything not in a user's live path — overnight extraction, bulk summarisation, evals, embeddings backfills. If nobody is waiting on the response, batch should be the default.
Model routing and tiering often 10×+ on routed calls
Mechanism: not every call needs your best model. Send easy, high-volume work to a cheaper tier and reserve the frontier model for the calls that genuinely need it — either by classifying difficulty up front, or by trying a cheap model first and escalating only on failure.
Rough saving: proportional to how much traffic you can safely move down — frequently the single largest lever, because the spread between tiers can exceed 10× on every dimension. Trade-off: a routing decision can be wrong, and a too-cheap route shows up as quality regressions. When it applies: any mixed workload where some calls are clearly easier than others — which is most of them. Let an eval, not a guess, set the routing threshold.
Shorter, structured output output is the pricey meter
Mechanism: output typically costs 2–5× input, and every token is generated sequentially, so it is the meter most worth shrinking. Ask for exactly what you need — a label, a JSON object, a short answer — instead of a discursive essay. Constrained or structured output formats keep responses tight and parseable at the same time.
Rough saving: scales directly with how much output you remove — often the difference between a cheap call and an expensive one. Trade-off: over-constraining can clip answers that genuinely needed room. When it applies: almost everywhere, but especially classification, extraction, and routing, where a long answer was never the point.
Trimming context and tightening retrieval cut input directly
Mechanism: for input-heavy features, the fastest cut is sending less. Tighten retrieval so you pass three relevant documents instead of ten; strip boilerplate, headers, and duplicated front-matter; summarise or drop stale conversation turns. Every token you don't send is a token you don't pay for — on every request, forever.
Rough saving: proportional to the context you remove — frequently 30–70% of input on bloated prompts. Trade-off: trim too aggressively and you drop the document that held the answer, raising failure and retry rates. The right target is precision, not simply less. When it applies: every input-heavy shape — RAG, summarisation, long-context chat.
Reasoning-effort control thinking is billed as output
Mechanism: reasoning models generate an internal chain of thought before answering, and those thinking tokens are billed at the output rate — often many times longer than the visible response. Most providers let you dial reasoning effort up or down (or off). Turning it down on tasks that don't need deliberation removes a large, invisible block of output tokens.
Rough saving: can be the largest of any lever on reasoning models, because thinking tokens often outnumber the visible answer by 10–20×. Trade-off: hard tasks genuinely benefit from reasoning, and turning it down too far costs accuracy on exactly the problems you chose a reasoning model for. The decision is per-task. When it applies: any reasoning-capable model on a mixed workload.
Stacking the levers
These savings multiply. Route the easy 80% to a small model. Cache the stable prefix on what's left. Batch anything no user is waiting on. Trim retrieval. Ask for tight output. Applied together, a workload can land at a small fraction of its naive cost without changing what it delivers. Priority order: routing and caching usually move the most; batch is free wherever latency allows; output discipline and context trimming tighten what's left.
None of these trade quality for cost — they trade convenience for cost. Whether the cheaper path still does the job is the one thing only your eval can tell you. Estimate the saving first, then verify.