Best starting models for RAG support bot, priced per call.
Retrieval does the hard part. The model reads a few fetched passages and answers without inventing, so most of the work is careful reading, not reasoning.
That makes the shape input-heavy: thousands of context tokens go in to get a short paragraph back. The retrieved context is the meter, and it runs on every query.
- Retrieved context dominates the input, on every query.
- Output is short, so the input rate sets the bill.
- Retrieval precision moves cost more than any model swap.
The pipeline
A feature is a chain of calls, each with a different job. Steps run top to bottom.
-
01
embed query
turn the question into a vector to search the index
per-call shape 1 sys + 50 in + 1 outNot priced here. This runs on a separate embeddings endpoint with its own rates the price table doesn't carry.
GPT-4.1 Nano Mistral Small 4 -
02
retrieve / rerank
score and order candidate passages so only the best go in
per-call shape 200 sys + 2K in + 30 out -
03
generate answer
read the retrieved context and answer without inventing
per-call shape 500 sys + 4.6K in + 250 out
How to choose for RAG support bot
The chain is embed, retrieve, then generate answer. Almost the whole bill lands on the last step, where thousands of retrieved tokens go in for a short paragraph back, so generate answer is both the cost-driver step and the capable-model step.
Pick the cheapest model that answers your retrieved context faithfully and start there. Spend on retrieval precision before you spend on a bigger model: three relevant passages instead of ten cuts the input meter on every call, and a model swap rarely beats that. Move up a tier only when the answers miss something a human reading the same passages would catch.
The takeaway
The cost-driver step and the capable-model step are the same one: generate answer. Spend there; keep the rest small.
No fabricated bills, no rankings.