Learn › Cost-cutting · June 2026

Cost-cutting strategies and what they save

Six concrete levers, each with a rough saving and its trade-off. None changes what your feature does.

Cutting an LLM bill is rarely about finding a cheaper model. It's about matching each call to the cheapest way to serve it: caching what repeats, batching what can wait, routing easy work to small models, not paying for output tokens you didn't need. These savings stack — a workload applying three or four of them often runs at a fraction of its naive cost.

All figures state their assumptions. Every cross-tier swap is a cost comparison, not a quality verdict — validate the cheaper path against your own eval before you rely on it.

Live · prices today

Output costs 4.0× input, on average

Output costs more than input across every provider. Across 280 models the multiple ranges from 0.1× to 12.2×.

input / 1M output / 1M per 1M tokens · tap a row for its history

Live from the index — output is the pricey meter these levers target.


Prompt caching 50–90% off cached input

Mechanism: if you send the same prefix on every request — a system prompt, a fixed instruction block, a set of retrieved documents, a few-shot example set — the provider can serve those tokens from a KV cache instead of reprocessing them. You pay a steep discount on the repeated portion.

RAG query — without vs with caching
Stable prefix (sys + docs)3,500 tokens
Fresh per-request input50 tokens
Output250 tokens
No cache 3,550 × $3 + 250 × $15 / 1M$0.01440
Cached 3,500 @ 90% off + 50 × $3 + 250 × $15$0.00540
Saving on this call~62%
The more of your input is a stable, repeating prefix, the larger the win. Agent loops re-sending a long context save the most.

Rough saving: 50–90% off the cached portion of input (the exact discount and mechanism vary by provider — some cache prefixes automatically, some require you to mark content explicitly). Trade-off: only the stable prefix benefits, so a system prompt or document set that changes per request is a cache miss; caches also expire after a short idle window. When it applies: anything with a repeating prefix — chat, RAG, agents, few-shot prompts.


Batch processing ~50% off

Mechanism: submit work as a non-real-time job and the provider runs it when capacity is free, usually within a few hours. In exchange you get roughly half off both input and output. The discount applies to the whole call, not just a prefix, so it composes with everything else.

10,000 documents — real-time vs batch
Per-doc cost (real-time)$0.0025
10,000 docs, real-time$25.00
10,000 docs, batch @ 50% off$12.50
Saving$12.50 (50%)
Free money on any job a user isn't waiting on. The only cost is wall-clock latency.

Rough saving: ~50% off standard prices at most providers. Trade-off: latency — results arrive in minutes to hours, not seconds, with no firm SLA. When it applies: anything not in a user's live path — overnight extraction, bulk summarisation, evals, embeddings backfills. If nobody is waiting on the response, batch should be the default.


Model routing and tiering often 10×+ on routed calls

Mechanism: not every call needs your best model. Send easy, high-volume work to a cheaper tier and reserve the frontier model for the calls that genuinely need it — either by classifying difficulty up front, or by trying a cheap model first and escalating only on failure.

100k calls/day — all-frontier vs routed
All frontier 100k × $0.012$1,200 / day
80% routed to small 80k × $0.0006$48 / day
20% kept on frontier 20k × $0.012$240 / day
Routed total$288 / day
A 76% cut, just by sending the easy 80% to a model that handles them fine. The saving tracks how much of your traffic is genuinely easy.

Rough saving: proportional to how much traffic you can safely move down — frequently the single largest lever, because the spread between tiers can exceed 10× on every dimension. Trade-off: a routing decision can be wrong, and a too-cheap route shows up as quality regressions. When it applies: any mixed workload where some calls are clearly easier than others — which is most of them. Let an eval, not a guess, set the routing threshold.


Shorter, structured output output is the pricey meter

Mechanism: output typically costs 2–5× input, and every token is generated sequentially, so it is the meter most worth shrinking. Ask for exactly what you need — a label, a JSON object, a short answer — instead of a discursive essay. Constrained or structured output formats keep responses tight and parseable at the same time.

Same task — verbose vs terse output
Input (same both ways)1,000 tokens
Verbose answer800 tokens out
Terse / structured answer120 tokens out
Verbose 1,000 × $1 + 800 × $5 / 1M$0.00500
Terse 1,000 × $1 + 120 × $5 / 1M$0.00160
Saving~68%
Output dominates the bill here even though it's a fraction of the token count. Trimming it moves the needle far more than trimming input.

Rough saving: scales directly with how much output you remove — often the difference between a cheap call and an expensive one. Trade-off: over-constraining can clip answers that genuinely needed room. When it applies: almost everywhere, but especially classification, extraction, and routing, where a long answer was never the point.


Trimming context and tightening retrieval cut input directly

Mechanism: for input-heavy features, the fastest cut is sending less. Tighten retrieval so you pass three relevant documents instead of ten; strip boilerplate, headers, and duplicated front-matter; summarise or drop stale conversation turns. Every token you don't send is a token you don't pay for — on every request, forever.

RAG retrieval — loose vs precise
Loose: top-10 docs9,000 tokens in
Precise: top-3 docs2,700 tokens in
Loose input cost 9,000 × $1 / 1M$0.00900
Precise input cost 2,700 × $1 / 1M$0.00270
Saving on input~70%
Better retrieval often improves answer quality too — less irrelevant context means less to get lost in. One of the rare levers that cuts cost and helps accuracy at once.

Rough saving: proportional to the context you remove — frequently 30–70% of input on bloated prompts. Trade-off: trim too aggressively and you drop the document that held the answer, raising failure and retry rates. The right target is precision, not simply less. When it applies: every input-heavy shape — RAG, summarisation, long-context chat.


Reasoning-effort control thinking is billed as output

Mechanism: reasoning models generate an internal chain of thought before answering, and those thinking tokens are billed at the output rate — often many times longer than the visible response. Most providers let you dial reasoning effort up or down (or off). Turning it down on tasks that don't need deliberation removes a large, invisible block of output tokens.

Same answer — high vs low reasoning effort
Visible answer (both)200 tokens
High effort: thinking3,800 tokens
Low effort: thinking400 tokens
High 4,000 output × $15 / 1M$0.06000
Low 600 output × $15 / 1M$0.00900
Saving~85%
The visible answer is identical; the hidden reasoning is where the money went. On easy tasks, most of that thinking was unnecessary.

Rough saving: can be the largest of any lever on reasoning models, because thinking tokens often outnumber the visible answer by 10–20×. Trade-off: hard tasks genuinely benefit from reasoning, and turning it down too far costs accuracy on exactly the problems you chose a reasoning model for. The decision is per-task. When it applies: any reasoning-capable model on a mixed workload.


Stacking the levers

These savings multiply. Route the easy 80% to a small model. Cache the stable prefix on what's left. Batch anything no user is waiting on. Trim retrieval. Ask for tight output. Applied together, a workload can land at a small fraction of its naive cost without changing what it delivers. Priority order: routing and caching usually move the most; batch is free wherever latency allows; output discipline and context trimming tighten what's left.

None of these trade quality for cost — they trade convenience for cost. Whether the cheaper path still does the job is the one thing only your eval can tell you. Estimate the saving first, then verify.