Learn › How pricing works · June 2026

How LLM pricing works

Every provider charges for the same thing — tokens in, tokens out — but the bill has more dimensions than the sticker price suggests.

Tokens: the unit of billing

LLMs don't read words — they read tokens, fragments of text that the model's tokeniser has split your input into. Most English words are one token. Longer or rarer words get split into pieces. Punctuation and whitespace are tokens too.

A real sentence tokenised:

The cost of an API call depends on token isation

The sentence "The cost of an API call depends on tokenisation" becomes 10 tokens. Notice "tokenisation" splits into two: the model knows "token" and "isation" separately. Roughly, 1 million tokens ≈ 750,000 words — about 8–10 novels.

Prices are quoted per million tokens (per 1M). When you see $3.00 on a pricing page, that's $3.00 per million tokens — or $0.000003 per token. Tiny per call, but it compounds fast at volume.

Input and output: two separate meters

Every API call has two billable components, priced separately:

You send

Input tokens

→

Model

Processes

→

You receive

Output tokens

Input is everything you send: your system prompt, user message, conversation history, retrieved documents, images. Output is everything the model generates in response.

Output almost always costs more — typically 2–5× input price. Why? The model reads input tokens in parallel (fast, efficient), but generates output tokens one at a time, each requiring a full forward pass through the model. Generation is computationally harder than comprehension.

For example, GPT-5 charges $1.25 input / $10.00 output per 1M tokens — output is 8.0× the input price.

Where your money actually goes

Most workloads are input-heavy. A RAG query might send 4,000 tokens of context and get back 200. A classification call sends a document and gets back one word. The ratio matters because it determines which price to optimise:

80%

20%

Input tokens Output tokens

Most workloads are input-heavy. A typical chat/agent run is roughly 3:1 input-to-output, and RAG or classification skew far higher. That's why the table sorts on input price by default — for most workloads it's the dimension that dominates the bill. Sort by output when your mix is output-heavy (reasoning, code generation).

The five pricing dimensions

The sticker price — input and output — is where most people stop. But the real bill has up to five dimensions:

Standard input

Your prompt, context, and conversation history. The base price you see on every pricing page.

Typical: $0.10 – $10 / 1M tokens

Standard output

The model's response. 2–5× input price because each token is generated sequentially.

Typical: $0.40 – $30 / 1M tokens

Cached input

Repeated prefixes (system prompts, docs) served from cache. 4–10× cheaper than standard input.

Typical: 50–90% off standard input

Batch

Non-real-time jobs processed in bulk. Usually 50% off both input and output — if you can wait hours.

Typical: 50% off standard prices

Thinking tokens

Reasoning models (o3, Claude with extended thinking) generate internal chain-of-thought tokens billed as output — often many times longer than the visible response. A 200-token answer might have 2,000 thinking tokens behind it, and you pay for all of them at output price.

Billed at output token rate

Cached input: the discount worth knowing

If you send the same prefix on every request — a system prompt, a set of retrieved documents, a few-shot example block — the provider can serve those tokens from a KV cache instead of reprocessing them. The discount is steep:

Anthropic: 90% off OpenAI: 50% off Google: 75% off

The catch: the mechanism varies by provider. Anthropic and OpenAI use prefix-based caching — the matching starts from the beginning of your prompt and extends continuously, so a changed system prompt means a cache miss. Google uses explicit caching — you designate content to cache via the API. Either way, the principle is the same: design your prompts with stable, repeating content front and centre.

For agent loops — where the same growing conversation gets resent on every step — cached input is often the largest line item on the bill. Two frontier models with the same headline price can differ sharply once caching enters the maths.

Worked example: a RAG query

A customer asks a question. Your app retrieves three documents and sends them to the model. The model answers in a paragraph.

Cost breakdown — single RAG query

System prompt 500 tokens

3 retrieved docs 3,000 tokens

User question 50 tokens

Total input 3,550 tokens

Model response 250 tokens (output)

Input cost 3,550 × $3.00 / 1M $0.01065

Output cost 250 × $15.00 / 1M $0.00375

Total per query $0.0144

At 10,000 queries/day → ~$144/day · $4,320/month

Now with cached input — if the system prompt (500 tokens) hits cache at 90% off:

Same query — with prompt caching

Cached input 500 × $0.30 / 1M $0.00015

Uncached input 3,050 × $3.00 / 1M $0.00915

Output 250 × $15.00 / 1M $0.00375

Total per query $0.01305

Saving: ~10% on this query. The savings scale with how much of your input is cacheable — agent loops with long, repeating contexts save far more.

Thinking tokens: the invisible multiplier

Reasoning models — OpenAI's o-series, Anthropic's Claude in extended thinking mode, Google's Gemini thinking — don't just produce a response. They first generate an internal chain of thought: planning, reasoning, self-correcting. These thinking tokens are billed as output, at output price.

The multiplier is often large:

Reasoning model — visible vs actual output

Visible response 200 tokens

Thinking (hidden) 3,800 tokens

Billed output 4,000 tokens

Your "200-token answer" actually consumed 4,000 output tokens — 20× what the visible response suggests.

Reasoning model pricing can look similar to standard models on paper. The per-token rate may be comparable, but the token count is not. Check whether a model uses thinking tokens before estimating cost — pricing pages don't always make this prominent.

How prices are shown here

Every model on the table shows normalised pricing in USD per 1M tokens — input, output, and cached where available. It sorts on input price by default, since most workloads are input-heavy; sort by output or cached input when your mix differs.

When a provider changes their prices, we don't overwrite — we add a new snapshot. Every price point on a model's detail page carries the date it was recorded and where it came from, and the market-events timeline tracks the launches and price moves across the market.

Browse models & prices