Best starting models for RAG support bot, priced per call.

Retrieval does the hard part. The model reads a few fetched passages and answers without inventing, so most of the work is careful reading, not reasoning.

That makes the shape input-heavy: thousands of context tokens go in to get a short paragraph back. The retrieved context is the meter, and it runs on every query.

  • Retrieved context dominates the input, on every query.
  • Output is short, so the input rate sets the bill.
  • Retrieval precision moves cost more than any model swap.

The pipeline

A feature is a chain of calls, each with a different job. Steps run top to bottom.

  1. 01

    embed query

    turn the question into a vector to search the index

    Small
    per-call shape 1 sys + 50 in + 1 out

    Not priced here. This runs on a separate embeddings endpoint with its own rates the price table doesn't carry.

    GPT-4.1 Nano Mistral Small 4
  2. 02

    retrieve / rerank

    score and order candidate passages so only the best go in

    Small
    per-call shape 200 sys + 2K in + 30 out
    cheap default Claude Haiku 4.5 ≈ $0.0024 per call
    step-up for quality Gemini 3.5 Flash ≈ $0.0036 per call
    open-weight option Mistral Small 4 ≈ $0.0002 per call
    See all small-tier models in the price table
  3. 03

    generate answer

    read the retrieved context and answer without inventing

    Mid cost-driver step capable-model step
    per-call shape 500 sys + 4.6K in + 250 out
    cheap default Claude Haiku 4.5 ≈ $0.0063 per call
    step-up for quality Claude Sonnet 4.6 ≈ $0.019 per call
    open-weight option Llama 4 Maverick ≈ $0.0009 per call
    See all mid-tier models in the price table

How to choose for RAG support bot

The chain is embed, retrieve, then generate answer. Almost the whole bill lands on the last step, where thousands of retrieved tokens go in for a short paragraph back, so generate answer is both the cost-driver step and the capable-model step.

Pick the cheapest model that answers your retrieved context faithfully and start there. Spend on retrieval precision before you spend on a bigger model: three relevant passages instead of ten cuts the input meter on every call, and a model swap rarely beats that. Move up a tier only when the answers miss something a human reading the same passages would catch.

The takeaway

The cost-driver step and the capable-model step are the same one: generate answer. Spend there; keep the rest small.

No fabricated bills, no rankings.