Guide

Best starting models for Coding agent, priced per call.

An agent reads a repo, plans a change, edits, runs tools, then re-checks. Each loop re-sends a growing context, so the same tokens get billed again and again.

Three drivers stack at once: a long re-sent context, a tool loop that repeats, and reasoning that bills as output. The plan step is where capability earns its keep; the looping edit step is where the spend piles up.

A long context, re-sent on every loop step.
A tool loop that repeats many times per task.
Reasoning tokens, billed as output, often longer than the visible action.

The pipeline

A feature is a chain of calls, each with a different job. Steps run top to bottom.

01

plan

decompose the task and decide what to change

Frontier capable-model step

per-call shape 2.5K sys + 8K in + 1.2K out

cheap default Claude Sonnet 4.6 ≈ $0.050 per call

step-up for quality Claude Opus 4.8 ≈ $0.082 per call

open-weight option DeepSeek V4 Pro ≈ $0.0056 per call
See all frontier-tier models in the price table
02

edit / tool-call

make changes and run tools, re-sending the growing context each step

Frontier repeats cost-driver step

per-call shape 2.5K sys + 24K in + 2K out

cheap default Claude Sonnet 4.6 ≈ $0.110 per call

step-up for quality Claude Opus 4.8 ≈ $0.182 per call

open-weight option DeepSeek V4 Pro ≈ $0.013 per call
See all frontier-tier models in the price table
03

verify

read tool output and decide whether the change is correct

Small

per-call shape 1K sys + 6K in + 200 out

cheap default Claude Haiku 4.5 ≈ $0.0080 per call

step-up for quality Gemini 3.5 Flash ≈ $0.012 per call

open-weight option Mistral Small 4 ≈ $0.0008 per call
See all small-tier models in the price table

How to choose for Coding agent

Three steps, three jobs: plan decides what to change, edit / tool-call makes it across a looping context, and verify checks the result. The cost-driver step and the capable-model step are different here, and getting that split right is the whole game.

Put the capable model on plan, the step that decides the change. The spend piles up on edit / tool-call, where a long context is re-sent on every loop, so the lever there is cached input, not a bigger model. Keep verify on a small model. A frontier model across the whole loop pays frontier rates for steps that never needed it.

The takeaway

The cost-driver step is edit / tool-call. The capable-model step is plan. They are different, so put the capable model on plan and keep the rest small.

No fabricated bills, no rankings.

Go deeper

Explainer See the full cost breakdown What this task costs and why, worked through line by line with live prices. Price table Every model, priced per 1M tokens Sort and filter the full catalog the options above link into.

All tasks in the guide