Benchmarks · May 2026 · 10,000-prompt run

~90% quality preserved.
98.4% cost reduction.

Measured quality, projected cost from a 10,000-prompt routed-vs-baseline run that routed a 70B model's traffic down to an 8B model — an extreme, reproducible case, not a typical bill. The 98.4% is a cost reduction (a pricing overlay), not a quality figure. Judged across 10 independent model families: preservation spans 61–97% (median ~90%); the open 7B judge (Mistral) sits at the lenient end (96.5%). Methodology, the full panel, caveats, and per-prompt JSON are below.

Quality preserved

~90%

10-judge median · 61–97% range (7B judge: 96.5%)

Cost reduction (projected)

98.35%

$162.79 → $2.68 · pricing overlay

Prompts judged

9,995

9 categories, 3 tiers

Scope

70B → 8B

extreme case — not a typical bill

Why this matters if you're on AWS Bedrock.

The benchmark ran on Llama-3.1-70B vs 8B for reproducible open models on our own GPUs. The cost figure is a projection: we applied public Claude Opus / Haiku rate-card pricing to the measured token counts — the same rates AWS Bedrock charges. It is not a measured Bedrock invoice.

The arithmetic: an all-Opus Bedrock workload pays about $15 input / $75 output per million tokens; the router sends the easy majority to Haiku at $0.80 input / $4 output — a ~19× per-token saving on routed work, at the quality preservation rate above. The same arithmetic applies if your traffic mix matches.

A Bedrock-direct re-run (the same eval, routed across bedrock-claude-opus-4-7, bedrock-claude-sonnet-4-6, bedrock-claude-haiku-4-5) is in flight. We'll publish that number here next to the Llama one when it lands.

Method.

For each of 10,000 prompts in our eval set:

Baseline pass — Llama-3.1-70B generated a response (Anthropic-Opus-class).
Routed pass — Llama-3.1-8B generated a response to the same prompt (Anthropic-Haiku-class).
Judge pass — Mistral-7B scored both responses on a 1–5 rubric (different model family from the candidates — avoids own-family bias).
Preserved = routed response scored ≥ 3 out of 5.
Cost projection — public Claude Opus / Haiku rate-card pricing applied to the measured token counts (a pricing overlay, not an invoiced bill).

Eval set: ~10,000 prompts from 16 public benchmarks (MMLU-Pro, GSM8K, HumanEval, BBH, PubMedQA, NarrativeQA, OASST1, Dolly, CNN/DailyMail, TruthfulQA, and more), stratified across 9 categories and 3 complexity tiers. One prompt failed to receive a judge score and was excluded — 9,995 judged, 9,649 preserved. (The published eval set has since grown to ~11.4K prompts; the linked file reflects the current, larger set.)

Infrastructure: the run took ~33 hours on 8× Tesla V100 (~$150 of GPU time).

Reproducibility: the judge is an open model (Mistral-7B); we re-judged the identical run with a 9-model cross-family panel (see caveats) — preservation spans 61–97%, median ~90%, no proprietary-API dependency. The eval set is public (linked above) and the per-prompt JSON (prompt, baseline output, routed output, judge raw text, and score for each row) is available on request — re-run the script and verify every number.

The breakdown.

Headline rate hides whether routing fails on a specific slice. It doesn't. All rates in this section are under the open 7B judge (the lenient end); the 10-judge cross-family panel runs lower on hard slices — see caveats.

Complexity tier	n	Preserved	Rate
Low	747	740	99.1%
Medium	5,302	5,219	98.4%
High	3,946	3,690	93.5%

Task category	n	Preserved	Rate
Long context	10	10	100%
Structured output	732	724	98.9%
Summarization	1,006	994	98.8%
Planning	988	976	98.8%
Creative	696	687	98.7%
Extraction	978	964	98.6%
Coding	179	176	98.3%
QA	2,691	2,626	97.6%
Reasoning	2,715	2,492	91.8%

Score distribution from the judge (1–5): 5: 1,686 · 4: 6,448 · 3: 1,515 · 2: 346 · 1: 0. Mode is 4 — "slightly worse but useful" — exactly what tier-down routing should look like. Reasoning is the weak spot (91.8%), consistent with what tier-down routing struggles with: multi-step chain-of-thought is where a smaller model loses ground first.

Honest caveats.

Read these before quoting the numbers internally:

Not a cloud-Claude eval yet. Llama on our GPUs, Claude pricing applied as the reference. Bedrock-direct re-run is queued; we'll publish that data side-by-side.
Judge model is small. Mistral-7B grading 70B output is a known weak spot. We re-judged the identical run with a 10-model cross-family panel (Qwen-32B, Gemma-27B, Yi-34B, Phi-14B, Command-R, Mixtral, DeepSeek-67B, InternLM2, Aya): preservation spans 61–97%, median ~90% — the 7B judge sits at the lenient end. Single-judge scores overstate quality, so we report the full panel, not one number.
Reasoning is the slice to watch. 91.8% preservation on reasoning prompts is the weakest cell. Multi-step chain-of-thought traffic should stay on the strong model until we get a better classifier for that slice.
Cost is projected, not invoiced. We measured input + output tokens during the run and applied published Claude rates as a pricing overlay. Your actual Bedrock invoice depends on your own token mix.
One prompt failed to receive a judge score (Mistral returned an unparseable response) and is excluded from the rate denominator — 9,995 judged, 9,649 preserved.
4 prompts removed post-hoc for brand-name hygiene (they mentioned a third-party company in news-style content from a public benchmark dataset). The eval set + result file in the private repo have been filtered. Headline numbers were recomputed from the filtered rows; the rate moved < 0.005 percentage points.
346 prompts scored 2/5 (significantly worse) — the routed model lost ground but didn't refuse or hallucinate. The confidence-aware escalation path would catch many of these in production by re-routing to the strong model on low confidence; this benchmark ran without escalation to keep the measurement clean.

~90% quality preserved.98.4% cost reduction.

Why this matters if you're on AWS Bedrock.

Method.

The breakdown.

Honest caveats.

Want this run on your own traffic?

~90% quality preserved.
98.4% cost reduction.