How LLM pricing works
Every provider charges for the same thing — tokens in, tokens out — but the bill has more dimensions than the sticker price suggests. Here's what you're actually paying for.
Tokens: the unit you're buying
LLMs don't read words — they read tokens, fragments of text that the model's tokeniser has split your input into. Most English words are one token. Longer or rarer words get split into pieces. Punctuation and whitespace are tokens too.
Here's how a real sentence gets tokenised:
The sentence "The cost of an API call depends on tokenisation" becomes 10 tokens. Notice "tokenisation" splits into two: the model knows "token" and "isation" separately. Roughly, 1 million tokens ≈ 750,000 words — about 8–10 novels.
Prices are quoted per million tokens (per 1M). When you see $3.00 on a pricing page, that's $3.00 per million tokens — or $0.000003 per token. Tiny per call, but it compounds fast at volume.
Input and output: two separate meters
Every API call has two billable components, priced separately:
Input is everything you send: your system prompt, user message, conversation history, retrieved documents, images. Output is everything the model generates in response.
Output almost always costs more — typically 2–5× input price. Why? The model reads input tokens in parallel (fast, efficient), but generates output tokens one at a time, each requiring a full forward pass through the model. Generation is computationally harder than comprehension.
For example, Llama 4 Maverick charges $0.15 input / $0.6 output per 1M tokens — output is 4.0× the input price.
Where your money actually goes
Most workloads are input-heavy. A RAG query might send 4,000 tokens of context and get back 200. A classification call sends a document and gets back one word. The ratio matters because it determines which price to optimise:
A typical chat/agent workload runs about 3:1 input-to-output. That's why the blended price on this site weights input at 75% — it's a closer approximation of your real bill than a simple average.
The five pricing dimensions
The sticker price — input and output — is where most people stop. But the real bill has up to five dimensions:
Cached input: the discount most people miss
If you send the same prefix on every request — a system prompt, a set of retrieved documents, a few-shot example block — the provider can serve those tokens from a KV cache instead of reprocessing them. The discount is steep:
The catch: the mechanism varies by provider. Anthropic and OpenAI use prefix-based caching — the matching starts from the beginning of your prompt and extends continuously, so a changed system prompt means a cache miss. Google uses explicit caching — you designate content to cache via the API. Either way, the principle is the same: design your prompts with stable, repeating content front and centre.
For agent loops — where the same growing conversation gets resent on every step — cached input is often the largest line item on the bill. Two frontier models with the same headline price can differ sharply once caching enters the maths.
Worked example: what does a RAG query cost?
Let's price a real request. A customer asks a question, your app retrieves three documents and sends them to the model, and the model answers in a paragraph.
Now with cached input — if the system prompt (500 tokens) hits cache at 90% off:
Thinking tokens: the invisible multiplier
Reasoning models — OpenAI's o-series, Anthropic's Claude in extended thinking mode, Google's Gemini thinking — don't just produce a response. They first generate an internal chain of thought: planning, reasoning, self-correcting. These thinking tokens are billed as output, at output price.
The multiplier can be dramatic:
This is why reasoning model pricing looks deceptively similar to standard models until you check the invoice. The per-token rate might be comparable, but the token count explodes. Always check whether a model uses thinking tokens before estimating cost — the pricing pages don't always make this prominent.
How we present this
Every model on the table shows normalised pricing in USD per 1M tokens — input, output, and cached where available. The blended price column gives a single sortable figure weighted 75% input / 25% output (approximating a typical 3:1 workload mix).
When a provider changes their prices, we don't overwrite — we add a new snapshot. The trends chart shows the full history, and every price point on a model's detail page carries the date it was recorded and where it came from.