Benchmarks · May 2026 · 10,000-prompt run

96.5% quality preserved.
98.4% cost reduction.

Measured quality, projected cost from a 10,000-prompt routed-vs-baseline run that routed a 70B model's traffic down to an 8B model — an extreme, reproducible case, not a typical bill. The 98.4% is a cost reduction (a pricing overlay), not a quality figure. Judged by a separate model family. Methodology, caveats, and the full per-prompt JSON are below.

Quality preserved
96.54%
9,649 / 9,995 judged ≥ 3 of 5
Cost reduction (projected)
98.35%
$162.75 → $2.68 · pricing overlay
Prompts judged
9,995
9 categories, 3 tiers
Scope
70B → 8B
extreme case — not a typical bill

Why this matters if you're on AWS Bedrock.

The benchmark ran on Llama-3.1-70B vs 8B for reproducible open models on our own GPUs. The cost figure is a projection: we applied public Claude Opus / Haiku rate-card pricing to the measured token counts — the same rates AWS Bedrock charges. It is not a measured Bedrock invoice.

The arithmetic: an all-Opus Bedrock workload pays about $15 input / $75 output per million tokens; the router sends the easy majority to Haiku at $0.80 input / $4 output — a ~19× per-token saving on routed work, at the quality preservation rate above. The same arithmetic applies if your traffic mix matches.

A Bedrock-direct re-run (the same eval, routed across bedrock-claude-opus-4-7, bedrock-claude-sonnet-4-6, bedrock-claude-haiku-4-5) is in flight. We'll publish that number here next to the Llama one when it lands.

Method.

For each of 10,000 prompts in our eval set:

  1. Baseline pass — Llama-3.1-70B generated a response (Anthropic-Opus-class).
  2. Routed pass — Llama-3.1-8B generated a response to the same prompt (Anthropic-Haiku-class).
  3. Judge pass — Mistral-7B scored both responses on a 1–5 rubric (different model family from the candidates — avoids own-family bias).
  4. Preserved = routed response scored ≥ 3 out of 5.
  5. Cost projection — public Claude Opus / Haiku rate-card pricing applied to the measured token counts (a pricing overlay, not an invoiced bill).

Eval set: ~10,000 prompts from 16 public benchmarks (MMLU-Pro, GSM8K, HumanEval, BBH, PubMedQA, NarrativeQA, OASST1, Dolly, CNN/DailyMail, TruthfulQA, and more), stratified across 9 categories and 3 complexity tiers. One prompt failed to receive a judge score and was excluded — 9,995 judged, 9,649 preserved. (The published eval set has since grown to ~11.4K prompts; the linked file reflects the current, larger set.)

Infrastructure: the run took ~33 hours on 8× Tesla V100 (~$150 of GPU time).

Reproducibility: the judge is an open model (Mistral-7B, with a Qwen-32B re-judge that held), so the run carries no proprietary-API dependency. The eval set is public (linked above) and the per-prompt JSON (prompt, baseline output, routed output, judge raw text, and score for each row) is available on request — re-run the script and verify every number.

The breakdown.

Headline rate hides whether routing fails on a specific slice. It doesn't.

Complexity tiernPreservedRate
Low74774099.1%
Medium5,3025,21998.4%
High3,9463,69093.5%
Task categorynPreservedRate
Long context1010100%
Structured output73272498.9%
Summarization1,00699498.8%
Planning98897698.8%
Creative69668798.7%
Extraction97896498.6%
Coding17917698.3%
QA2,6912,62697.6%
Reasoning2,7152,49291.8%

Score distribution from the judge (1–5): 5: 1,686 · 4: 6,448 · 3: 1,515 · 2: 346 · 1: 0. Mode is 4 — "slightly worse but useful" — exactly what tier-down routing should look like. Reasoning is the weak spot (91.8%), consistent with what tier-down routing struggles with: multi-step chain-of-thought is where a smaller model loses ground first.

Honest caveats.

Read these before quoting the numbers internally:

Want this run on your own traffic?

Ship us a representative sample of your last seven days of Bedrock prompts (NDA available, on-premise replay option available). We hand back the same report — quality preservation, projected savings, per-task breakdown — on your prompts, not ours.

Book a demo