Measured quality, projected cost from a 10,000-prompt routed-vs-baseline run that routed a 70B model's traffic down to an 8B model — an extreme, reproducible case, not a typical bill. The 98.4% is a cost reduction (a pricing overlay), not a quality figure. Judged by a separate model family. Methodology, caveats, and the full per-prompt JSON are below.
The benchmark ran on Llama-3.1-70B vs 8B for reproducible open models on our own GPUs. The cost figure is a projection: we applied public Claude Opus / Haiku rate-card pricing to the measured token counts — the same rates AWS Bedrock charges. It is not a measured Bedrock invoice.
The arithmetic: an all-Opus Bedrock workload pays about $15 input / $75 output per million tokens; the router sends the easy majority to Haiku at $0.80 input / $4 output — a ~19× per-token saving on routed work, at the quality preservation rate above. The same arithmetic applies if your traffic mix matches.
A Bedrock-direct re-run (the same eval, routed across bedrock-claude-opus-4-7,
bedrock-claude-sonnet-4-6, bedrock-claude-haiku-4-5) is in flight. We'll
publish that number here next to the Llama one when it lands.
For each of 10,000 prompts in our eval set:
Eval set: ~10,000 prompts from 16 public benchmarks (MMLU-Pro, GSM8K, HumanEval, BBH, PubMedQA, NarrativeQA, OASST1, Dolly, CNN/DailyMail, TruthfulQA, and more), stratified across 9 categories and 3 complexity tiers. One prompt failed to receive a judge score and was excluded — 9,995 judged, 9,649 preserved. (The published eval set has since grown to ~11.4K prompts; the linked file reflects the current, larger set.)
Infrastructure: the run took ~33 hours on 8× Tesla V100 (~$150 of GPU time).
Reproducibility: the judge is an open model (Mistral-7B, with a Qwen-32B re-judge that held), so the run carries no proprietary-API dependency. The eval set is public (linked above) and the per-prompt JSON (prompt, baseline output, routed output, judge raw text, and score for each row) is available on request — re-run the script and verify every number.
Headline rate hides whether routing fails on a specific slice. It doesn't.
| Complexity tier | n | Preserved | Rate |
|---|---|---|---|
| Low | 747 | 740 | 99.1% |
| Medium | 5,302 | 5,219 | 98.4% |
| High | 3,946 | 3,690 | 93.5% |
| Task category | n | Preserved | Rate |
|---|---|---|---|
| Long context | 10 | 10 | 100% |
| Structured output | 732 | 724 | 98.9% |
| Summarization | 1,006 | 994 | 98.8% |
| Planning | 988 | 976 | 98.8% |
| Creative | 696 | 687 | 98.7% |
| Extraction | 978 | 964 | 98.6% |
| Coding | 179 | 176 | 98.3% |
| QA | 2,691 | 2,626 | 97.6% |
| Reasoning | 2,715 | 2,492 | 91.8% |
Score distribution from the judge (1–5): 5: 1,686 · 4: 6,448 · 3: 1,515 · 2: 346 · 1: 0.
Mode is 4 — "slightly worse but useful" — exactly what tier-down routing should look like.
Reasoning is the weak spot (91.8%), consistent with what tier-down routing struggles with:
multi-step chain-of-thought is where a smaller model loses ground first.
Read these before quoting the numbers internally:
Ship us a representative sample of your last seven days of Bedrock prompts (NDA available, on-premise replay option available). We hand back the same report — quality preservation, projected savings, per-task breakdown — on your prompts, not ours.
Book a demo