Reasoning & "thinking" tokens.
Reasoning models think before they answer, and that thinking bills at the output rate. It's usually the biggest line on the invoice and the easiest one to misjudge.
A standard model reads your prompt and writes a reply. A reasoning model adds a step in between: before it answers, it generates an internal chain of thought — planning, working through cases, checking itself. Those are thinking tokens, and they're billed as output, at the output rate, exactly like the visible reply. The difference is you mostly don't see them. The answer is 200 tokens; the thinking behind it might be 4,000.
So the meter runs on tokens you never read. That's what makes reasoning cost hard to estimate from a pricing page: the per-token rate looks ordinary, but the token count is several times what the visible output suggests.
Effort is a volume dial, not a price dial
Every provider gives you a way to turn reasoning up or down. The exact control varies, but it comes in three shapes:
None of these change the price per token. They change how many output tokens the model spends, which is why reasoning bills at the output rate rather than some separate reasoning rate. To the invoice, a thinking token is an output token like any other. Effort is a volume control on the most expensive meter there is.
Output costs 4.0× input, on average
Output costs more than input across every provider. Across 283 models the multiple ranges from 0.1× to 12.2×.
Live from the index — thinking bills at the output rate, so this spread is the reasoning tax.
What the dial costs
Take the cheapest frontier model on the index today, Llama 4 Maverick, at $0.6 output per 1M tokens. Hold the visible answer at 300 tokens and move only the thinking:
Why it's often the biggest line
Output is already the expensive meter. It runs several times the input rate because the model generates each token sequentially. Reasoning multiplies that meter specifically, so a reasoning-heavy call can spend more on thinking the reader never sees than on its entire input and visible answer combined.
The effect is largest in agent loops. An agent makes many calls per task, and when each step reasons, the thinking tax lands on every step, on top of a transcript that grows each turn. The per-step reasoning and the lengthening transcript compound, so the cost of a reasoning agent climbs faster than either effect would on its own.
There's no fixed multiplier per model
How much more high effort costs on a given model depends on the task, not on the model alone. The same model might spend twice its baseline thinking on a simple lookup and eight times as much on a hard proof. Effort sets a ceiling and a tendency, and the prompt decides where within that range a given call lands.
That's why this site prices what it can verify — the published per-token rates — and doesn't publish a per-model "effort multiplier." Any single number would be wrong for most workloads. The reliable way to know your multiplier is to measure it: run a representative sample of your own prompts at each effort level and read the output-token counts back from the API's usage field. Once you have that ratio, the cost is just that token count times the rate above.
Turning it down
Reasoning effort is one of the largest levers on a bill, because thinking tokens routinely outnumber the visible answer by 10–20×. There are a few ways to bring it down:
- Turn it off for easy work. Classification, extraction, formatting, and routing rarely need deliberation. Run them with thinking disabled or at the minimum.
- Set effort per task, not per app. Reserve high effort for the calls where correctness pays for the tokens, and dial the rest down.
- Cap the budget. Where the provider takes a token budget, a ceiling stops a hard prompt from running the meter unbounded.
- Route the easy share. Send high-volume, low-difficulty traffic to a cheaper or non-reasoning model entirely.
The catch is the same one that applies to every cost lever: hard tasks genuinely benefit from reasoning, and turning it down too far costs accuracy on exactly the problems you reached for a reasoning model to solve. The decision is per-task, and only your eval settles it.