Learn › Cost-cutting · June 2026

Cost-cutting strategies and what they save

Six concrete levers, each with a rough saving and its trade-off. None changes what your feature does.

Cutting an LLM bill is rarely about finding a cheaper model. It's about matching each call to the cheapest way to serve it: caching what repeats, batching what can wait, routing easy work to small models, not paying for output tokens you didn't need. These savings stack — a workload applying three or four of them often runs at a fraction of its naive cost.

All figures state their assumptions. Every cross-tier swap is a cost comparison, not a quality verdict — validate the cheaper path against your own eval before you rely on it.

Live · prices today

Output costs 4.0× input, on average

Output costs more than input across every provider. Across 280 models the multiple ranges from 0.1× to 12.2×.

I Ling-2.6-flash$0.01 in · $0.03 out 3.0× Llama 3.1 8B Instruct$0.02 in · $0.03 out 1.5× Mistral Nemo$0.02 in · $0.03 out 1.5× S Llama 3 8B Lunaris$0.04 in · $0.05 out 1.3× G MythoMax 13B$0.06 in · $0.06 out 1.0× I Granite 4.0 Micro$0.017 in · $0.112 out 6.6×

input / 1M output / 1M per 1M tokens · tap a row for its history

Live from the index — output is the pricey meter these levers target.

Prompt caching 50–90% off cached input

Mechanism: if you send the same prefix on every request — a system prompt, a fixed instruction block, a set of retrieved documents, a few-shot example set — the provider can serve those tokens from a KV cache instead of reprocessing them. You pay a steep discount on the repeated portion.

RAG query — without vs with caching

Stable prefix (sys + docs)3,500 tokens

Fresh per-request input50 tokens

Output250 tokens

No cache 3,550 × $3 + 250 × $15 / 1M$0.01440

Cached 3,500 @ 90% off + 50 × $3 + 250 × $15$0.00540

Saving on this call~62%

The more of your input is a stable, repeating prefix, the larger the win. Agent loops re-sending a long context save the most.

Rough saving: 50–90% off the cached portion of input (the exact discount and mechanism vary by provider — some cache prefixes automatically, some require you to mark content explicitly). Trade-off: only the stable prefix benefits, so a system prompt or document set that changes per request is a cache miss; caches also expire after a short idle window. When it applies: anything with a repeating prefix — chat, RAG, agents, few-shot prompts.

Batch processing ~50% off

Mechanism: submit work as a non-real-time job and the provider runs it when capacity is free, usually within a few hours. In exchange you get roughly half off both input and output. The discount applies to the whole call, not just a prefix, so it composes with everything else.

10,000 documents — real-time vs batch

Per-doc cost (real-time)$0.0025

10,000 docs, real-time$25.00

10,000 docs, batch @ 50% off$12.50

Saving$12.50 (50%)

Free money on any job a user isn't waiting on. The only cost is wall-clock latency.

Rough saving: ~50% off standard prices at most providers. Trade-off: latency — results arrive in minutes to hours, not seconds, with no firm SLA. When it applies: anything not in a user's live path — overnight extraction, bulk summarisation, evals, embeddings backfills. If nobody is waiting on the response, batch should be the default.

Model routing and tiering often 10×+ on routed calls

Mechanism: not every call needs your best model. Send easy, high-volume work to a cheaper tier and reserve the frontier model for the calls that genuinely need it — either by classifying difficulty up front, or by trying a cheap model first and escalating only on failure.

100k calls/day — all-frontier vs routed

All frontier 100k × $0.012$1,200 / day

80% routed to small 80k × $0.0006$48 / day

20% kept on frontier 20k × $0.012$240 / day

Routed total$288 / day

A 76% cut, just by sending the easy 80% to a model that handles them fine. The saving tracks how much of your traffic is genuinely easy.

Rough saving: proportional to how much traffic you can safely move down — frequently the single largest lever, because the spread between tiers can exceed 10× on every dimension. Trade-off: a routing decision can be wrong, and a too-cheap route shows up as quality regressions. When it applies: any mixed workload where some calls are clearly easier than others — which is most of them. Let an eval, not a guess, set the routing threshold.

Shorter, structured output output is the pricey meter

Mechanism: output typically costs 2–5× input, and every token is generated sequentially, so it is the meter most worth shrinking. Ask for exactly what you need — a label, a JSON object, a short answer — instead of a discursive essay. Constrained or structured output formats keep responses tight and parseable at the same time.

Same task — verbose vs terse output

Input (same both ways)1,000 tokens

Verbose answer800 tokens out

Terse / structured answer120 tokens out

Verbose 1,000 × $1 + 800 × $5 / 1M$0.00500

Terse 1,000 × $1 + 120 × $5 / 1M$0.00160

Saving~68%

Output dominates the bill here even though it's a fraction of the token count. Trimming it moves the needle far more than trimming input.

Rough saving: scales directly with how much output you remove — often the difference between a cheap call and an expensive one. Trade-off: over-constraining can clip answers that genuinely needed room. When it applies: almost everywhere, but especially classification, extraction, and routing, where a long answer was never the point.

Trimming context and tightening retrieval cut input directly

Mechanism: for input-heavy features, the fastest cut is sending less. Tighten retrieval so you pass three relevant documents instead of ten; strip boilerplate, headers, and duplicated front-matter; summarise or drop stale conversation turns. Every token you don't send is a token you don't pay for — on every request, forever.

RAG retrieval — loose vs precise

Loose: top-10 docs9,000 tokens in

Precise: top-3 docs2,700 tokens in

Loose input cost 9,000 × $1 / 1M$0.00900

Precise input cost 2,700 × $1 / 1M$0.00270

Saving on input~70%

Better retrieval often improves answer quality too — less irrelevant context means less to get lost in. One of the rare levers that cuts cost and helps accuracy at once.

Rough saving: proportional to the context you remove — frequently 30–70% of input on bloated prompts. Trade-off: trim too aggressively and you drop the document that held the answer, raising failure and retry rates. The right target is precision, not simply less. When it applies: every input-heavy shape — RAG, summarisation, long-context chat.

Reasoning-effort control thinking is billed as output

Mechanism: reasoning models generate an internal chain of thought before answering, and those thinking tokens are billed at the output rate — often many times longer than the visible response. Most providers let you dial reasoning effort up or down (or off). Turning it down on tasks that don't need deliberation removes a large, invisible block of output tokens.

Same answer — high vs low reasoning effort

Visible answer (both)200 tokens

High effort: thinking3,800 tokens

Low effort: thinking400 tokens

High 4,000 output × $15 / 1M$0.06000

Low 600 output × $15 / 1M$0.00900

Saving~85%

The visible answer is identical; the hidden reasoning is where the money went. On easy tasks, most of that thinking was unnecessary.

Rough saving: can be the largest of any lever on reasoning models, because thinking tokens often outnumber the visible answer by 10–20×. Trade-off: hard tasks genuinely benefit from reasoning, and turning it down too far costs accuracy on exactly the problems you chose a reasoning model for. The decision is per-task. When it applies: any reasoning-capable model on a mixed workload.

Stacking the levers

These savings multiply. Route the easy 80% to a small model. Cache the stable prefix on what's left. Batch anything no user is waiting on. Trim retrieval. Ask for tight output. Applied together, a workload can land at a small fraction of its naive cost without changing what it delivers. Priority order: routing and caching usually move the most; batch is free wherever latency allows; output discipline and context trimming tighten what's left.

None of these trade quality for cost — they trade convenience for cost. Whether the cheaper path still does the job is the one thing only your eval can tell you. Estimate the saving first, then verify.

Estimate what you'd save → How LLM pricing works