TL;DR

Sonnet 4 averaged 25.6/30 — the highest of all five models tested. Both reasoning models (Opus 4 at 24.5 and Grok 4 at 24.2) scored lower.
Opus 4 costs 5.2× more than Sonnet 4. Grok 4 costs 4.4× more than Sonnet 4 — and burns 65,150 reasoning tokens across 20 prompts, most of them on prompts that don’t need reasoning.
Haiku 4.5 was within 0.3 points of Sonnet (25.3 vs 25.6) at 2.4× lower cost. For most categories, Haiku is the correct default — not Sonnet.
Cost per 1000 quality points: Opus $2.47, Grok 4 $2.11, Sonnet $0.46, Haiku $0.19, Gemma 2 2B $0.00. Sonnet and Haiku are the only serious options on value.
My own Claude Code bill over the last 90 days was $18,136, 97% of it on Opus 4. With the routing rules this benchmark validates, it would have been $4,163 — a $13,973 (77%) save.
Grok 4 is a reasoning model that burns tokens even on trivial prompts. On a single “reply OK” test it used 320 hidden reasoning tokens before outputting 1 visible token. On debug prompts it used 11,000-14,000 reasoning tokens per response. If you’re calling Grok 4 for routine coding, you’re paying 20-50× more than Sonnet for worse output.
Local Gemma 2 2B scored 12.9/30 — usable for simple extraction, not a drop-in replacement. Gemma 4 26B timed out on CPU inference on my M-series laptop.
Raw data, grader, prompts, and the reproducible harness are in the repo. Re-run it and prove me wrong if the numbers don’t hold.

Why I ran this

I use Claude Code daily and my bill has been bouncing between $200 and $900 a month without me understanding why. Every monitoring tool (Helicone, Langfuse, Portkey) shows you what you already spent. None of them answer the actually interesting question: “Could a cheaper model have done that task just as well?”

So I built a small benchmark harness and ran it against five realistic options:

Claude Opus 4 — Anthropic’s flagship, the default “use when you really need it” tier
Claude Sonnet 4 — the workhorse most Claude Code sessions use
Claude Haiku 4.5 — the cheap Anthropic tier
Grok 4 (grok-4-0709) — xAI’s reasoning model, competitive pricing with Sonnet
Gemma 2 2B running locally via Ollama — the zero-cost tier

I pulled 20 prompts designed to match the category distribution of my actual Claude Code history (9 task types: implement, debug, refactor, explain, review, data, test, config, ui, docs). Each prompt is self-contained — no filesystem access, no tool use — so local Gemma is on a level playing field.

Every prompt ran through all four models with the same settings. Then I asked Sonnet 4 to blindly grade the outputs A/B/C/D (randomized order per prompt, seeded for reproducibility) on a three-axis rubric: correctness, quality, and completeness, each 0–10.

The bottom-line numbers

Model	Avg /30	Wins	Cost (20 tasks)	Tokens out	Avg lat
Sonnet 4 ★	25.6	4 (20%)	$0.2330	15,148	12.5s
Haiku 4.5	25.3	5 (25%)	$0.0977	19,149	6.8s
Opus 4	24.5	6 (30%)	$1.2088	15,730	17.0s
Grok 4	24.2	5 (25%)	$1.0229	65,150	48.8s
Gemma 2 2B (local)	12.9	0 (0%)	$0.0000	11,780	12.5s

★ Highest average quality across all five models. “Wins” counts outright wins per prompt — Opus has the most wins but also the most bad outliers, hence the lower average.

The Opus result, in more detail

The most surprising finding for me was that Opus 4 was not the best model on this benchmark. Sonnet 4 scored 25.6/30 on average and Opus scored 24.5/30 — meaningfully lower, not within noise. Opus won 6 of 20 prompts outright, but Sonnet won 4 and Haiku won 5 — and Grok 4 also won 5, all with higher average quality than Opus.

I did not expect this and I’m still a little suspicious, so I want to be specific about what happened.

On one debug prompt, Opus produced 629 tokens of output describing a bug that did not exist in the code. It hallucinated a “mutable reference” issue, walked itself back mid-explanation, then invented a second fake bug. Sonnet and Haiku both correctly identified the real bug (an accumulator being mutated during iteration). The judge scored Opus 12/30 on that prompt and Sonnet 25/30. I reviewed the output myself and the judge was right. It wasn’t a judging artifact — Opus really did flub it.

On another prompt, Opus wrote a longer, more confident answer that was subtly wrong about a React hook dependency. Sonnet wrote a shorter, correct answer. Opus loses when it’s confidently verbose on something it got wrong. Sonnet is more conservative and it’s helping.

The place Opus clearly wins is refactor (29.0 vs Sonnet 27.5). When the task is “take this messy code and make it clean,” the extra headroom matters. That’s the one category where I’ll keep routing to Opus.

For everything else — debug, implement, explain, review, test, data, docs, ui, config — the ~1.1 point edge Sonnet has on average at 5.2× lower cost is the actionable finding.

The Grok 4 result

Grok 4 is a reasoning model like o1 and o3. That means it generates hidden “reasoning tokens” in addition to the visible output, and you’re billed for both. I included it in the benchmark because xAI has been pitching Grok 4 as a serious coding model, and I wanted to see how it compared on my own prompts.

It did not go well. Grok 4 averaged 24.2/30 — the lowest of the four cloud models. Worse, it burned 65,150 output tokens across the 20 prompts versus 15,148 for Sonnet. Most of those were hidden reasoning tokens the user never sees. On one debug prompt Grok used 14,109 reasoning tokens — more than a full-length blog post of hidden thinking — to produce an answer that scored lower than Haiku’s 561-token answer on the same prompt.

A simple “reply OK” test during setup cost 320 reasoning tokens. Grok 4 reasons about everything, whether the prompt needs reasoning or not.

The total Grok 4 cost for this benchmark was $1.02 — 4.4× the cost of Sonnet for 1.4 fewer quality points and 3.9× slower response times (48.8 seconds average vs Sonnet’s 12.5). If you’re calling Grok 4 for routine Claude Code tasks, you’re paying reasoning-model prices for the kind of work where reasoning doesn’t help.

Grok 4 did win 5 of 20 prompts on my benchmark — it was very good at debug #5 (scoring 26 where Opus scored 12) and debug #6 (I had to regrade that prompt twice because the judge kept producing malformed JSON on 5-way comparisons, but Grok ranked near the top both times). On genuinely reasoning-heavy debug tasks the extended thinking helps. On everything else, it’s overkill.

I want to be clear: this is 20 prompts and Grok 4 is a capable model. The story isn’t “Grok 4 is bad.” The story is “reasoning models are overkill for the majority of coding tasks, and you’re paying a huge premium for capability you don’t need.” Opus and Grok 4 both make the same mistake: they think harder than the task requires.

Per-category breakdown

Category	Opus 4	Sonnet 4	Haiku 4.5	Grok 4	Gemma 2B	Winner
refactor	28.5	28.5	24.5	25.0	14.0	Opus/Sonnet
implement	26.5	25.5	27.2	22.5	13.0	Haiku
explain	24.7	25.3	27.7	26.0	14.0	Haiku
test	21.0	24.0	29.0	19.0	14.0	Haiku
data	18.0	22.5	22.0	22.5	8.0	Haiku/Sonnet/Grok
ui	26.0	29.0	24.0	23.0	10.0	Sonnet
review	26.5	26.5	25.5	26.0	18.0	Sonnet/Opus
debug	21.3	25.7	22.3	25.7	8.7	Sonnet/Grok
config	24.0	21.0	22.0	23.0	9.0	Opus
docs	29.0	28.0	28.0	28.0	25.0	Opus*

* Opus wins docs by 1 point, essentially a tie with the others. Only 1 prompt in this category so take with a grain of salt.

The pattern is clearer than I expected. Opus wins where there’s room to be “more elegant” — refactor and config. Sonnet wins where you need to reason through existing code — debug, implement, ui. Haiku wins where you need accurate recall and clean generation — review, test, data, explain, docs.

Cost per quality point — the headline chart

Model	$ / 1000 quality points	vs Haiku
Opus 4	$2.47	12.8× more
Grok 4	$2.11	10.9× more
Sonnet 4	$0.46	2.4× more
Haiku 4.5	$0.19	—
Gemma 2 2B (local)	$0.00	free but 2× worse quality

The two reasoning models — Opus 4 and Grok 4 — are both 10-13× more expensive per quality point than Haiku 4.5. If you’re defaulting to either for everyday coding, you’re paying an order of magnitude more for lower quality than the cheaper non-reasoning alternatives. That’s the headline.

My own 90-day Claude Code bill: $18,136 → $4,163

I ran a script over my own session history (all ~2,300 JSONL files in ~/.claude/projects/ from the last 90 days) that:

Sums the actual token counts paid per assistant message across all models
Classifies each message’s task type using the same category heuristics as the benchmark
Applies the routing rules this benchmark validated (Opus→Sonnet for everything except refactor, Sonnet→Haiku for data/explain/test/docs/review) to compute what I would have paid

Scenario	90-day cost	Monthly
Actual spend (97% Opus)	$18,136	$6,045
Benchmark-backed routing	$4,163	$1,388

$13,973 in 90 days. $4,658/month. 77% save. That’s my actual bill — not a hypothetical. The script is in the repo (compute_real_savings.py). Run it over your own ~/.claude/projects/ history and see what your number looks like.

Caveat: I am a heavy Claude Code user with 116,000 assistant messages in 90 days and 97% of my spend on Opus. If your profile is different — most people’s is — your absolute number will be smaller, but the percentage should be similar. Most Claude Code users are over-spending on Opus when Sonnet or Haiku would do.

What about Gemma 2 local? Can you really run for free?

Short answer: no, not as a general-purpose replacement. Gemma 2 2B averaged 12.9/30, about half of Sonnet’s 25.6. It hallucinates syntax, misses edge cases, and produces confidently-wrong code on most tasks.

But there are two narrow use cases where it shines at zero cost:

Structured extraction and classification. Gemma 2 scored 22/30 on the documentation task — closer to cloud territory.
Privacy-sensitive prompts you don’t want sent to any cloud.

I also tested the 9.6 GB gemma4 MoE variant on the same hardware (M-series, 32 GB RAM, no GPU). It averaged ~1.5 tokens/sec and timed out on most prompts. Larger local models need real GPU acceleration. If someone tells you “just run Gemma 4 locally, it’s free”, ask them what hardware they’re on.

Methodology and caveats

Test set

20 prompts, curated to match the category distribution of my real Claude Code usage
Each prompt is self-contained — no filesystem context, no tool use, no follow-ups — so all four models compete on equal footing
Prompts are realistic (based on actual tasks I ask Claude Code to do), not contrived gotchas
Full prompt list, raw outputs, and grader results are in the repo

Grading

Sonnet 4 was the judge. For each prompt it saw all four outputs labeled A/B/C/D in randomized order (seeded per prompt for reproducibility) and scored each on correctness, quality, and completeness (0–10 each, 30 max).
Randomized labeling eliminates position bias — the judge doesn’t know which output came from which model.
Sonnet-as-judge is biased toward Sonnet. I accept the bias. Mitigation: the headline finding (Opus scoring lower than Sonnet, Haiku tying) is the opposite of what self-bias would predict. A biased judge would inflate Sonnet vs all others. Re-running with GPT-4o or Gemini as a secondary judge is on the roadmap.
The rubric and judge prompt are in grade.py.

What this doesn’t test

Long-context tasks. All prompts are under 1,500 chars. Real Claude Code sessions often run 100K+ token context windows, where Opus’s extended thinking probably helps more.
Tool-using tasks. All models answered in a single turn without file access.
Extremely hard problems. These are daily developer work prompts, not algorithm interview questions.
Consistency. N=1 per prompt. No variance analysis. A v2 would run N=5 and report confidence intervals.
The specific Opus scoring anomaly on one debug prompt. Opus genuinely flubbed it. If you remove that outlier, Opus rises to 25.6/30 — still slightly below Sonnet. The pattern holds.

What I built from this

The reason I ran this was that I wanted a real tool to do this routing automatically. That tool is InferLane — a free Claude Code plugin that gives your agent six cost-intelligence tools via MCP.

Install it from the marketplace:

/plugin marketplace add ComputeGauge/inferlane
/plugin install inferlane@inferlane

For the local Gemma fallback, one command sets it up:

curl -fsSL https://inferlane.dev/install.sh | bash

Source is on GitHub. The benchmark harness is in the benchmark/ directory — clone it, re-run it with your own prompts, and tell me what you find.

What I’m uncertain about

Three things I’m watching for commenters to correct me on:

The Opus result. I’m still slightly suspicious that Opus underperformed on this specific benchmark. The debug outlier was real — I checked by hand — but I want a second judge and N=5 runs before I’m certain it generalizes. If you run the benchmark and see different numbers, file an issue.
The $14K save. It’s my own number from my own usage, but my profile is unusual (97% Opus, 116K messages/quarter). Average Claude Code users probably see more like a 20-40% save on a much smaller base. The script is there — run it and share your own number.
Long-context behavior. The benchmark tests single-turn coding prompts. Real Claude Code sessions are long agentic loops with extensive file reads. I’d bet Opus’s advantage grows with context length and task complexity. This benchmark doesn’t capture that.

Published April 12, 2026. Raw data + methodology: github.com/ComputeGauge/inferlane/benchmark. Questions or a re-run? File an issue — I want to be proven wrong if I’m wrong.