Which model should you actually use?
The short version
Staring at a model dropdown in Claude Code, Codex, Cursor or a chat app? You don't need token maths to make a good pick — almost every lineup is really just three tiers, and the names change but the trade-off doesn't:
Fast and cheap. Haiku, GPT-mini, Flash. Good for quick edits, search, simple Q&A and high-volume routine work, and a reasonable default for more tasks than it usually gets.
The all-rounder. Sonnet, GPT standard, Gemini Pro. The sweet spot for everyday coding, writing and answering questions over documents — good enough for most things without frontier prices.
The capable, slower, costly one. Fable, Opus, GPT's reasoning models. Worth it for hard, multi-step work — agentic coding, difficult debugging, deep analysis — where being a little better each step compounds. Often 10× the cost of a small model, and more than most everyday tasks need.
Rule of thumb: start one tier lower than feels comfortable, and only move up when the cheaper model actually falls short. The rest of this page is the longer version — task by task — and the price table shows what each tier costs.
A common and costly pattern is using a frontier model for everything. It's understandable — "the good one" feels like the safe default — but the price spread between tiers is large: within Anthropic's own lineup, Claude Fable 5 costs 10× what Claude Haiku 4.5 does on every dimension, and across providers the gap between the priciest frontier model and the cheapest small one is more than 300×. A tier mismatch isn't a rounding error; on a high-volume task it can dominate the bill.
A simple rule works well: use the cheapest model that passes your eval, and route up when it fails. With no eval yet, the rule can run in reverse — prototype on a frontier model to find the quality ceiling, then walk down tiers until something breaks. The eval is the arbiter, not this page and not a leaderboard. As a starting point, here's a guide to the five most common task shapes: what tier is usually enough, and which price dimension tends to dominate the cost — often not the one on the sticker.
Classification and routing
tier: small · cost driver: input price
Labelling tickets, moderating content, routing requests — a paragraph or a page in, a token or two out. This is the task small models are built for, and providers themselves position their smallest models for classification and routing. (A large label taxonomy or subtle multi-label calls can push the choice up a tier — that's what the eval is for.) Because output is near-zero, input price is effectively the entire bill, and small-tier input starts around $0.10 per million tokens. A frontier model here adds cost without adding much that the task needs.
Data extraction to JSON
tier: small–mid · cost driver: input price
Pulling structured fields out of messy text. Small models handle typical schemas; stepping up to mid helps when the source is adversarial or a missed field is expensive. Deep nesting is a separate problem — it trips up every tier, so it's usually better to simplify the schema than to upgrade the model. The metric that matters here is cost per accepted document, not cost per request: a cheap model that fails twice and needs human review isn't really cheap. Output is short and mechanical, so input price dominates again — and extraction is a classic candidate for batch pricing, which halves the bill at most providers when results can wait minutes instead of seconds.
RAG answering
tier: mid · cost driver: input price, then cached input
Answering questions over retrieved documents. The retrieval does the hard part; the model mostly needs to read carefully and decline to invent. Mid-tier models are reliably good enough here, and the economics are lopsided: thousands of tokens of context go in to get a paragraph back, so the spend is overwhelmingly on input tokens. If the system prompt and document set repeat across requests, cached-input pricing — typically 4–10× cheaper than regular input — can change which model comes out cheapest.
Long-document summarisation
tier: small–mid · cost driver: input price
The most input-heavy shape there is: a hundred pages in, one out — so tier choice gets multiplied by every page fed in. Cheap long-context models have made routine summarisation a small-tier task; mid or above is worth paying for when missing a clause in the middle of page sixty actually carries a cost — "lost in the middle" is reduced, not solved. And because summarisation is rarely latency-sensitive, it's another natural batch-pricing candidate.
Agentic coding
tier: frontier · cost driver: cached input, then output
This is the main exception. Multi-step agents that read a codebase, plan, edit, and re-check compound their errors — a model that's 95% as good per step is far less than 95% as good after thirty steps. This is the task where frontier prices tend to pay off. (For the main loop, at least — an emerging pattern is a frontier orchestrator delegating search and exploration to small-tier subagents.) One detail is easy to overlook: agents re-send the same growing context on every step, so most of the spend goes on cache reads, and cached-input price matters more than headline input price. Two frontier models with the same sticker can differ sharply once caching enters the maths — it's worth checking the cached column on the table, not just the first one.
Caveats
These are dated judgments, not benchmarks — they say "good enough for typical schemas," never "scores 87.2," and when the market moves (a mid model gets frontier-grade at coding, say) these verdicts get revised, visibly. The tier floor is also only half the decision: a "small" verdict still leaves more than a 10× spread between the cheapest and priciest small models, which is what the table is for.
And the other half — "fine, mid tier for RAG, but what does that cost at my volume?" — is the question this site exists to answer next. A cost calculator is on the way: pick a task shape like the ones above, set your volume, and get a ranked monthly bill across every model tracked here, at live prices, shareable as a link. Until then: start cheap, write an eval, and only pay frontier prices where the task actually earns them.