Best starting models for Coding agent, priced per call.

An agent reads a repo, plans a change, edits, runs tools, then re-checks. Each loop re-sends a growing context, so the same tokens get billed again and again.

Three drivers stack at once: a long re-sent context, a tool loop that repeats, and reasoning that bills as output. The plan step is where capability earns its keep; the looping edit step is where the spend piles up.

  • A long context, re-sent on every loop step.
  • A tool loop that repeats many times per task.
  • Reasoning tokens, billed as output, often longer than the visible action.

The pipeline

A feature is a chain of calls, each with a different job. Steps run top to bottom.

  1. 01

    plan

    decompose the task and decide what to change

    Frontier capable-model step
    per-call shape 2.5K sys + 8K in + 1.2K out
    cheap default Claude Sonnet 4.6 ≈ $0.050 per call
    step-up for quality Claude Opus 4.8 ≈ $0.082 per call
    open-weight option DeepSeek V4 Pro ≈ $0.0056 per call
    See all frontier-tier models in the price table
  2. 02

    edit / tool-call

    make changes and run tools, re-sending the growing context each step

    Frontier repeats cost-driver step
    per-call shape 2.5K sys + 24K in + 2K out
    cheap default Claude Sonnet 4.6 ≈ $0.110 per call
    step-up for quality Claude Opus 4.8 ≈ $0.182 per call
    open-weight option DeepSeek V4 Pro ≈ $0.013 per call
    See all frontier-tier models in the price table
  3. 03

    verify

    read tool output and decide whether the change is correct

    Small
    per-call shape 1K sys + 6K in + 200 out
    cheap default Claude Haiku 4.5 ≈ $0.0080 per call
    step-up for quality Gemini 3.5 Flash ≈ $0.012 per call
    open-weight option Mistral Small 4 ≈ $0.0008 per call
    See all small-tier models in the price table

How to choose for Coding agent

Three steps, three jobs: plan decides what to change, edit / tool-call makes it across a looping context, and verify checks the result. The cost-driver step and the capable-model step are different here, and getting that split right is the whole game.

Put the capable model on plan, the step that decides the change. The spend piles up on edit / tool-call, where a long context is re-sent on every loop, so the lever there is cached input, not a bigger model. Keep verify on a small model. A frontier model across the whole loop pays frontier rates for steps that never needed it.

The takeaway

The cost-driver step is edit / tool-call. The capable-model step is plan. They are different, so put the capable model on plan and keep the rest small.

No fabricated bills, no rankings.