June 2026

How LLM pricing works

Every provider charges for the same thing — tokens in, tokens out — but the bill has more dimensions than the sticker price suggests. Here's what you're actually paying for.

Tokens: the unit you're buying

LLMs don't read words — they read tokens, fragments of text that the model's tokeniser has split your input into. Most English words are one token. Longer or rarer words get split into pieces. Punctuation and whitespace are tokens too.

Here's how a real sentence gets tokenised:

The cost of an API call depends on token isation

The sentence "The cost of an API call depends on tokenisation" becomes 10 tokens. Notice "tokenisation" splits into two: the model knows "token" and "isation" separately. Roughly, 1 million tokens ≈ 750,000 words — about 8–10 novels.

Prices are quoted per million tokens (per 1M). When you see $3.00 on a pricing page, that's $3.00 per million tokens — or $0.000003 per token. Tiny per call, but it compounds fast at volume.


Input and output: two separate meters

Every API call has two billable components, priced separately:

You send
Input tokens
Model
Processes
You receive
Output tokens

Input is everything you send: your system prompt, user message, conversation history, retrieved documents, images. Output is everything the model generates in response.

Output almost always costs more — typically 2–5× input price. Why? The model reads input tokens in parallel (fast, efficient), but generates output tokens one at a time, each requiring a full forward pass through the model. Generation is computationally harder than comprehension.

For example, Llama 4 Maverick charges $0.15 input / $0.6 output per 1M tokens — output is 4.0× the input price.

Where your money actually goes

Most workloads are input-heavy. A RAG query might send 4,000 tokens of context and get back 200. A classification call sends a document and gets back one word. The ratio matters because it determines which price to optimise:

80%
20%
Input tokens Output tokens

A typical chat/agent workload runs about 3:1 input-to-output. That's why the blended price on this site weights input at 75% — it's a closer approximation of your real bill than a simple average.


The five pricing dimensions

The sticker price — input and output — is where most people stop. But the real bill has up to five dimensions:

Standard input
Your prompt, context, and conversation history. The base price you see on every pricing page.
Typical: $0.10 – $10 / 1M tokens
Standard output
The model's response. 2–5× input price because each token is generated sequentially.
Typical: $0.40 – $30 / 1M tokens
Cached input
Repeated prefixes (system prompts, docs) served from cache. 4–10× cheaper than standard input.
Typical: 50–90% off standard input
Batch
Non-real-time jobs processed in bulk. Usually 50% off both input and output — if you can wait hours.
Typical: 50% off standard prices
Thinking tokens
Reasoning models (o3, Claude with extended thinking) generate internal chain-of-thought tokens billed as output — often many times longer than the visible response. A 200-token answer might have 2,000 thinking tokens behind it, and you pay for all of them at output price.
Billed at output token rate

Cached input: the discount most people miss

If you send the same prefix on every request — a system prompt, a set of retrieved documents, a few-shot example block — the provider can serve those tokens from a KV cache instead of reprocessing them. The discount is steep:

Anthropic: 90% off OpenAI: 50% off Google: 75% off

The catch: the mechanism varies by provider. Anthropic and OpenAI use prefix-based caching — the matching starts from the beginning of your prompt and extends continuously, so a changed system prompt means a cache miss. Google uses explicit caching — you designate content to cache via the API. Either way, the principle is the same: design your prompts with stable, repeating content front and centre.

For agent loops — where the same growing conversation gets resent on every step — cached input is often the largest line item on the bill. Two frontier models with the same headline price can differ sharply once caching enters the maths.


Worked example: what does a RAG query cost?

Let's price a real request. A customer asks a question, your app retrieves three documents and sends them to the model, and the model answers in a paragraph.

Cost breakdown — single RAG query
System prompt 500 tokens
3 retrieved docs 3,000 tokens
User question 50 tokens
Total input 3,550 tokens
Model response 250 tokens (output)
Input cost 3,550 × $3.00 / 1M $0.01065
Output cost 250 × $15.00 / 1M $0.00375
Total per query $0.0144
At 10,000 queries/day → ~$144/day · $4,320/month

Now with cached input — if the system prompt (500 tokens) hits cache at 90% off:

Same query — with prompt caching
Cached input 500 × $0.30 / 1M $0.00015
Uncached input 3,050 × $3.00 / 1M $0.00915
Output 250 × $15.00 / 1M $0.00375
Total per query $0.01305
Saving: ~10% on this query. The savings scale with how much of your input is cacheable — agent loops with long, repeating contexts save far more.

Thinking tokens: the invisible multiplier

Reasoning models — OpenAI's o-series, Anthropic's Claude in extended thinking mode, Google's Gemini thinking — don't just produce a response. They first generate an internal chain of thought: planning, reasoning, self-correcting. These thinking tokens are billed as output, at output price.

The multiplier can be dramatic:

Reasoning model — visible vs actual output
Visible response 200 tokens
Thinking (hidden) 3,800 tokens
Billed output 4,000 tokens
Your "200-token answer" actually consumed 4,000 output tokens — 20× what the visible response suggests.

This is why reasoning model pricing looks deceptively similar to standard models until you check the invoice. The per-token rate might be comparable, but the token count explodes. Always check whether a model uses thinking tokens before estimating cost — the pricing pages don't always make this prominent.


How we present this

Every model on the table shows normalised pricing in USD per 1M tokens — input, output, and cached where available. The blended price column gives a single sortable figure weighted 75% input / 25% output (approximating a typical 3:1 workload mix).

When a provider changes their prices, we don't overwrite — we add a new snapshot. The trends chart shows the full history, and every price point on a model's detail page carries the date it was recorded and where it came from.