Best starting models for Support chatbot, priced per call.

A conversation accumulates. Every turn re-sends the transcript so the model keeps context, which means input grows with the dialogue and you pay to reprocess the history on each reply.

The chain is intent, retrieve, generate. The cheap classify step runs every turn; the generate step does the answering and is the only place that needs a capable model.

  • Re-sent transcript grows with every turn.
  • The classify step runs once per turn, cheaply.
  • The stable prefix is highly cacheable.

The pipeline

A feature is a chain of calls, each with a different job. Steps run top to bottom.

  1. 01

    intent / route

    classify the message and pick a path (FAQ, handoff, tool)

    Small
    per-call shape 400 sys + 300 in + 5 out
    cheap default GPT-4.1 Nano ≈ <$0.0001 per call
    step-up for quality Claude Haiku 4.5 ≈ $0.0007 per call
    open-weight option Mistral Small 4 ≈ <$0.0001 per call
    See all small-tier models in the price table
  2. 02

    retrieve

    pull relevant help-centre passages for grounding

    Small
    per-call shape 200 sys + 1.5K in + 30 out
    cheap default Claude Haiku 4.5 ≈ $0.0018 per call
    step-up for quality Gemini 3.5 Flash ≈ $0.0028 per call
    open-weight option Llama 4 Scout ≈ $0.0001 per call
    See all small-tier models in the price table
  3. 03

    generate reply

    answer in context, re-sending the accumulating transcript each turn

    Mid cost-driver step capable-model step
    per-call shape 400 sys + 3.1K in + 250 out
    cheap default Claude Haiku 4.5 ≈ $0.0048 per call
    step-up for quality Claude Sonnet 4.6 ≈ $0.014 per call
    open-weight option Llama 4 Maverick ≈ $0.0007 per call
    See all mid-tier models in the price table

How to choose for Support chatbot

Every turn runs intent / route, then retrieve, then generate reply, and the transcript grows each turn. The cost-driver step and the capable-model step are different: the cheap classify and retrieve steps run on every turn, while generate reply is the only one that needs a capable model.

Start generate reply on a small or mid model and reserve a step up for genuinely hard threads. Keep intent / route and retrieve small; they run constantly and should cost almost nothing per call. The stable prefix of the transcript is highly cacheable, which is the single biggest lever on this shape.

The takeaway

The cost-driver step and the capable-model step are the same one: generate reply. Spend there; keep the rest small.

No fabricated bills, no rankings.