Guide

Best starting models for RAG support bot, priced per call.

Retrieval does the hard part. The model reads a few fetched passages and answers without inventing, so most of the work is careful reading, not reasoning.

That makes the shape input-heavy: thousands of context tokens go in to get a short paragraph back. The retrieved context is the meter, and it runs on every query.

Retrieved context dominates the input, on every query.
Output is short, so the input rate sets the bill.
Retrieval precision moves cost more than any model swap.

The pipeline

A feature is a chain of calls, each with a different job. Steps run top to bottom.

01

embed query

turn the question into a vector to search the index

Small

per-call shape 1 sys + 50 in + 1 out

Not priced here. This runs on a separate embeddings endpoint with its own rates the price table doesn't carry.

GPT-4.1 Nano Mistral Small 4
02

retrieve / rerank

score and order candidate passages so only the best go in

Small

per-call shape 200 sys + 2K in + 30 out

cheap default Claude Haiku 4.5 ≈ $0.0024 per call

step-up for quality Gemini 3.5 Flash ≈ $0.0036 per call

open-weight option Mistral Small 4 ≈ $0.0002 per call
See all small-tier models in the price table
03

generate answer

read the retrieved context and answer without inventing

Mid cost-driver step capable-model step

per-call shape 500 sys + 4.6K in + 250 out

cheap default Claude Haiku 4.5 ≈ $0.0063 per call

step-up for quality Claude Sonnet 4.6 ≈ $0.019 per call

open-weight option Llama 4 Maverick ≈ $0.0009 per call
See all mid-tier models in the price table

How to choose for RAG support bot

The chain is embed, retrieve, then generate answer. Almost the whole bill lands on the last step, where thousands of retrieved tokens go in for a short paragraph back, so generate answer is both the cost-driver step and the capable-model step.

Pick the cheapest model that answers your retrieved context faithfully and start there. Spend on retrieval precision before you spend on a bigger model: three relevant passages instead of ten cuts the input meter on every call, and a model swap rarely beats that. Move up a tier only when the answers miss something a human reading the same passages would catch.

The takeaway

The cost-driver step and the capable-model step are the same one: generate answer. Spend there; keep the rest small.

No fabricated bills, no rankings.

Go deeper

Explainer See the full cost breakdown What this task costs and why, worked through line by line with live prices. Price table Every model, priced per 1M tokens Sort and filter the full catalog the options above link into.

All tasks in the guide

The pipeline

embed query

retrieve / rerank

generate answer

How to choose for RAG support bot

The takeaway

Go deeper