Evaluation Report: Held-out v2 (April 2026)¶
Authoritative 6-provider comparison under the autoresearch v2 framework. Dev/held-out dataset split, champion prompts ported across providers, dual-judge scoring. Supersedes the v1 benchmark report and v1 smoke report for cloud API providers; v1 reports remain authoritative for Ollama and hybrid_ml pending v2 re-run.
| Field | Value |
|---|---|
| Date | April 2026 (2026-04-15) |
| Framework | autoresearch v2 (RFC-073) |
| Dev dataset | curated_5feeds_dev_v1 (10 ep, e01+e02 per feed) — iteration only |
| Held-out dataset | curated_5feeds_benchmark_v2 (5 ep, e03 per feed, ~32 min each) — never used during tuning |
| Silver (paragraphs) | silver_sonnet46_dev_v1_paragraph, silver_sonnet46_benchmark_v2_paragraph |
| Silver (bullets) | silver_sonnet46_dev_v1_bullets, silver_sonnet46_benchmark_v2_bullets |
| Judges | gpt-4o-mini + claude-haiku-4-5-20251001 (dual, fraction-based contestation) |
| Providers evaluated | OpenAI, Anthropic, Gemini, Mistral, DeepSeek, Grok |
For metric definitions and interpretation guidance, see the Evaluation Reports index. For the framework design rationale (why dev/held-out split, why fraction-based contestation, etc.), see RFC-073.
Six API providers evaluated end-to-end under the same framework with the same champion prompts (ported from OpenAI r7 champion). All champions validated on held-out content they were never tuned against. This is the authoritative reference for provider selection.
Headline Matrix (held-out, authoritative)¶
Final blended score = 0.70 * ROUGE-L + 0.30 * judge_mean. Higher better.
| Track | Approach | OpenAI | Anthropic | Gemini | Mistral | DeepSeek | Grok |
|---|---|---|---|---|---|---|---|
| Bullets | Bundled | 0.505 | 0.552 | 0.473 | 0.487 | 0.511 | 0.489 |
| Bullets | Non-bundled | 0.566 | 0.570 | 0.562 | 0.537 | 0.586 | 0.553 |
| Paragraph | Bundled | 0.469 | 0.548 | 0.461 | 0.487 | 0.523 | 0.481 |
| Paragraph | Non-bundled | 0.481 | 0.522 | 0.463 | 0.456 | 0.541 | 0.479 |
Cell winners on quality alone (held-out):
| Cell | Winner | Score |
|---|---|---|
| Bullets · Bundled | Anthropic haiku-4.5 | 0.552 |
| Bullets · Non-bundled | DeepSeek deepseek-chat | 0.586 |
| Paragraph · Bundled | Anthropic haiku-4.5 | 0.548 |
| Paragraph · Non-bundled | DeepSeek deepseek-chat | 0.541 |
Quality is one dimension. Once latency and cost are counted, Gemini 2.0-flash is the default pick for balanced use cases (0.562 quality, 2.0s/ep, $0.00035/ep). See the Compound analysis § Pareto frontier and Recommended option order by use case for full quality × latency × cost tradeoffs.
ROUGE-L breakdown (held-out)¶
| Track | Approach | OpenAI | Anthropic | Gemini | Mistral | DeepSeek | Grok |
|---|---|---|---|---|---|---|---|
| Bullets | Bundled | 33.2% | 39.3% | 28.5% | 30.4% | 34.4% | 30.5% |
| Bullets | Non-bundled | 39.6% | 40.7% | 40.1% | 37.3% | 43.1% | 38.6% |
| Paragraph | Bundled | 29.5% | 39.2% | 26.6% | 30.3% | 35.9% | 28.9% |
| Paragraph | Non-bundled | 31.7% | 36.4% | 29.3% | 27.8% | 40.7% | 31.3% |
Dev scores (for generalisation sanity-check)¶
| Track | Approach | OpenAI | Anthropic | Gemini | Mistral | DeepSeek | Grok |
|---|---|---|---|---|---|---|---|
| Bullets | Bundled | 0.476 | 0.549 | 0.489 | 0.475 | 0.519 | 0.502 |
| Bullets | Non-bundled | 0.564 | 0.572 | 0.566 | 0.521 | 0.572 | 0.511 |
| Paragraph | Bundled | 0.467 | 0.523 | 0.462 | 0.479 | 0.479 | 0.441 |
| Paragraph | Non-bundled | 0.460 | 0.526 | 0.472 | 0.456 | 0.536 | 0.478 |
All 24 champions generalise within ±5% dev→held-out. No overfitting caught.
Local models (Ollama)¶
Eleven local models evaluated as a full local-vs-cloud comparison. Same champion prompts ported from OpenAI r7. Non-bundled for all eleven; bundled also evaluated for the three 9B-35B models with reliable JSON mode (qwen3.5:9b, qwen3.5:35b, mistral-small3.2). qwen2.5:32b v2 not evaluated — qwen3.5 generation (9b, 27b, 35b) covers the Qwen family; qwen2.5:32b would not change recommendations.
Held-out non-bundled (5 ep e03) — sorted by bullets final¶
| Model | Size | Bullets ROUGE-L | Bullets final | Paragraph ROUGE-L | Paragraph final |
|---|---|---|---|---|---|
| qwen3.5:9b | 9B | 42.8% | 0.580 | 34.8% | 0.505 |
| qwen3.5:35b | 35B | 41.3% | 0.576 | 32.5% | 0.325 (contested) |
| qwen3.5:27b | 27B | 36.3% | 0.543 | 35.0% | 0.499 |
| mistral-small3.2 | ~22B | 36.1% | 0.536 | 28.8% | 0.288 (contested) |
| mistral:7b | 7B | 36.2% | 0.526 | 31.2% | 0.475 |
| llama3.1:8b | 8B | 33.8% | 0.518 | 30.5% | 0.305 (contested 5/5) |
| llama3.2:3b | 3B | 33.0% | 0.501 | 27.0% | 0.270 (contested) |
| mistral-nemo:12b | 12B | 30.4% | 0.497 | 27.1% | 0.445 |
| gemma2:9b | 9B | 30.3% | 0.492 | 28.0% | 0.453 |
| qwen2.5:7b | 7B | 29.6% | 0.477 | 28.6% | 0.463 |
| phi3:mini | 3.8B | 31.9% | 0.475 | 19.6% | 0.196 (contested) |
Dev non-bundled (10 ep e01+e02)¶
| Model | Bullets | Paragraph |
|---|---|---|
| qwen3.5:9b | 0.571 | 0.493 |
| qwen3.5:35b | 0.560 | 0.491 |
| qwen3.5:27b | 0.545 | 0.512 |
| mistral-small3.2 | 0.551 | 0.497 |
| mistral:7b | 0.512 | 0.472 |
| llama3.2:3b | 0.533 | 0.441 |
| llama3.1:8b | 0.497 | 0.455 |
| mistral-nemo:12b | 0.505 | 0.450 |
| gemma2:9b | 0.495 | 0.459 |
| qwen2.5:7b | 0.461 | 0.443 |
| phi3:mini | 0.483 | 0.209 (contested) |
Held-out bundled (5 ep e03)¶
| Model | Bullets ROUGE-L | Bullets final | Paragraph ROUGE-L | Paragraph final |
|---|---|---|---|---|
| qwen3.5:9b | 35.8% | 0.529 | 33.1% | 0.509 |
| qwen3.5:35b | 33.3% | 0.514 | 30.8% | 0.492 |
| mistral-small3.2 | 30.3% | 0.488 | 27.5% | 0.468 |
All 12 bundled cells ran cleanly with zero contestation — a stark contrast to the non-bundled paragraph track where 3 of 4 models contested. The structured JSON schema stabilises output across judges.
Dev (10 ep e01+e02)¶
| Model | Bullets final | Paragraph final |
|---|---|---|
| qwen3.5:9b | 0.571 | 0.493 |
| qwen3.5:35b | 0.560 | 0.491 |
| mistral-small3.2 | 0.551 | 0.497 |
| llama3.2:3b | 0.533 | 0.441 |
Full matrix with latency and cost (non-bundled bullets held-out)¶
Per-episode latency from actual run metrics (baseline.json stats.avg_time_seconds). Cost
per episode computed from approximate transcript size (~2.7k input tokens + ~500 output tokens)
and Mar 2026 pricing.
| Rank | Provider+model | Final | ROUGE-L | Latency | $/ep | Q/sec | Q/$ |
|---|---|---|---|---|---|---|---|
| 1 | DeepSeek (deepseek-chat) | 0.586 | 43.1% | 10.2s | $0.00052 | 0.058 | 1131 |
| 2 | Ollama qwen3.5:9b | 0.580 | 42.8% | 33.3s | $0 | 0.017 | ∞ |
| 3 | Ollama qwen3.5:35b | 0.576 | 41.3% | 23.1s | $0 | 0.025 | ∞ |
| 4 | Anthropic (haiku-4.5) | 0.570 | 40.7% | 4.8s | $0.00416 | 0.119 | 137 |
| 5 | OpenAI (gpt-4o) | 0.566 | 39.6% | 4.6s | $0.01175 | 0.123 | 48 |
| 6 | Gemini 2.5-flash-lite (new) | 0.564 | 39.6% | 1.5s | ~$0.00047 | 0.376 | 1200 |
| 7 | Gemini 2.0-flash | 0.562 | 40.1% | 2.0s | $0.00035 | 0.281 | 1606 |
| 8 | Grok (grok-3-mini) | 0.553 | 38.6% | 19.4s | $0.00106 | 0.029 | 522 |
| 9 | Ollama qwen3.5:27b† | 0.543 | 36.3% | 505s† | $0 | 0.001† | ∞ |
| 10 | gpt-4o-mini (new) | 0.540 | 37.1% | 6.6s | ~$0.00074 | 0.082 | 730 |
| 11 | Mistral (mistral-small-latest) | 0.537 | 37.3% | 2.2s | $0.00084 | 0.244 | 639 |
| 12 | Ollama mistral-small3.2 | 0.536 | 36.1% | 79.2s | $0 | 0.007 | ∞ |
| 13 | Ollama mistral:7b | 0.526 | 36.2% | 28.9s | $0 | 0.018 | ∞ |
| 14 | Ollama llama3.1:8b | 0.518 | 33.8% | 24.5s | $0 | 0.021 | ∞ |
| 15 | Ollama llama3.2:3b | 0.501 | 33.0% | 12.2s | $0 | 0.041 | ∞ |
| 16 | Ollama mistral-nemo:12b | 0.497 | 30.4% | 33.4s | $0 | 0.015 | ∞ |
| 17 | Ollama gemma2:9b | 0.492 | 30.3% | 28.6s | $0 | 0.017 | ∞ |
| 18 | Mistral-medium (new, surprisingly weak) | 0.488 | 29.1% | 5.9s | ~higher | — | worse |
| 19 | Ollama qwen2.5:7b | 0.477 | 29.6% | 22.9s | $0 | 0.021 | ∞ |
| 20 | Ollama phi3:mini | 0.475 | 31.9% | 17.7s | $0 | 0.027 | ∞ |
†qwen3.5:27b bullets held-out shows anomalous 505s — likely cold-start / model swap overhead from Ollama. Paragraph cell on same model was 188s, more typical. Treat this cell as warm-up noise; realistic per-episode latency is probably in the ~100s range.
Compound analysis — Pareto frontier¶
Three dimensions worth considering: quality (final score), latency (s/ep), cost ($/ep). A pick is on the Pareto frontier if no other option is strictly better on all three.
Pareto-optimal cloud:
- Gemini 2.0-flash — cheapest + fastest + not bottom-quality. Dominates on cost + latency.
- Anthropic haiku-4.5 — 2nd fastest, 4th quality, mid cost. Middle-of-frontier.
- DeepSeek — #1 quality, middling latency, very cheap. Quality-first frontier.
Pareto-optimal local:
- Ollama qwen3.5:9b — #1 local quality at moderate local latency.
- Ollama llama3.2:3b — fastest local (12s) at acceptable quality (0.501).
Dominated picks (avoid unless ecosystem-locked):
- OpenAI gpt-4o: Anthropic has higher quality, same latency, 3× cheaper. Nothing gained.
- Grok: slower than Anthropic, lower quality. No wins.
- Mistral cloud: faster than DeepSeek but lower quality; Gemini beats it on all three axes.
- Most local models (mistral:7b, llama3.1:8b, qwen2.5:7b, gemma2:9b, mistral-nemo:12b, phi3, mistral-small3.2, qwen3.5:27b, qwen3.5:35b): dominated by qwen3.5:9b on quality or llama3.2:3b on speed.
Recommended option order by use case¶
A. Quality first — you can pay, you can wait.
- DeepSeek non-bundled (0.586, 10s, $0.0005) — top quality, very cheap, worst-case a few seconds.
- Anthropic haiku-4.5 bundled (0.552, 7s, $0.004) — if you need single-call title+summary+bullets.
- Ollama qwen3.5:9b (0.580, 33s) — only if cloud is off the table.
B. Cost first — cloud, quality secondary.
- Gemini 2.0-flash non-bundled bullets (0.562, 2s, $0.00035) — 3× cheaper than DeepSeek, almost as fast, ~4% lower quality.
- DeepSeek non-bundled (0.586, 10s, $0.0005) — for 50% more cost, +4% quality, 5× slower.
- Mistral cloud (0.537, 2s, $0.00084) — only if Gemini locked out for some reason.
C. Throughput / latency first — batch processing, real-time serving.
- Gemini 2.0-flash (2.0s) — 2× faster than Anthropic, 5× faster than DeepSeek. Cheapest too.
- Mistral cloud (2.2s) — similar latency, weaker quality.
- Anthropic haiku-4.5 (4.8s) — fastest on the quality frontier.
D. Privacy / offline first — local only, no external calls.
- Ollama qwen3.5:9b (0.580, 33s) — best local quality. Close to DeepSeek cloud.
- Ollama llama3.2:3b (0.501, 12s) — for resource-constrained devices; faster, lower quality.
- Ollama mistral:7b (0.526, 29s) — only non-Qwen/Llama worth considering locally.
E. Balanced — you want all three: reasonable quality + fast + cheap.
- Anthropic haiku-4.5 non-bundled (0.570, 4.8s, $0.004) — on the frontier for balance.
- Gemini 2.0-flash non-bundled (0.562, 2s, $0.00035) — slightly worse quality, massively better latency+cost. Almost always the right balanced pick.
- DeepSeek non-bundled (0.586, 10s, $0.0005) — quality premium over Gemini for 5× latency.
The short answer¶
For most production deployments, Gemini 2.0-flash non-bundled bullets is the correct first pick. It sits on every Pareto frontier, wins on latency and cost, and loses only ~4% on quality vs the absolute best. Upgrade to DeepSeek (quality), Anthropic bundled (single-call), or Ollama qwen3.5:9b (privacy) only when a specific dimension justifies the tradeoff.
Headline finding: qwen3.5:9b (open-weights, local, free) lands 2nd in the whole matrix
for bullets held-out, 0.3pp ROUGE-L behind DeepSeek. On-prem / offline deployments can
essentially match the best cloud option on bullets quality. Seven of the top 11 entries are
local — local models are fully competitive with cloud APIs on this workload.
Size vs quality for Qwen3.5 family: 9B (0.580) > 35B (0.576) > 27B (0.543). The smallest
is the strongest — suggests the 9B is the pareto-optimal choice and the larger models add cost
without proportional gain. Similar pattern at 7B size: mistral:7b (0.526) noticeably beats
qwen2.5:7b (0.477) — family generation matters more than size at this scale.
Non-bundled paragraph contests on several local models; bundled fixes it¶
Five of eleven local models contested on held-out non-bundled paragraph (judges diverged >40% of episodes): llama3.2:3b, llama3.1:8b (5/5!), qwen3.5:35b, mistral-small3.2, phi3:mini. Six models were uncontested: qwen3.5:9b (2/5 under threshold), qwen3.5:27b, mistral:7b, mistral-nemo:12b, gemma2:9b, qwen2.5:7b.
Pattern: contestation is not strictly tied to size or generation. Contestation correlates with structural inconsistency across episodes, which is somewhat model-specific and hard to predict.
When the 3 largest bundled-evaluated models are run in bundled mode (same prompt content, wrapped in the JSON schema), all three produce uncontested paragraph output at meaningfully higher final scores:
| Model | Non-bundled paragraph final | Bundled paragraph final | Gain |
|---|---|---|---|
| qwen3.5:9b | 0.505 (2/5 contested) | 0.509 | +0.8% (and stable) |
| qwen3.5:35b | 0.325 (contested, ROUGE-only) | 0.492 | +51% |
| mistral-small3.2 | 0.288 (contested, ROUGE-only) | 0.468 | +63% |
The JSON schema appears to force local models into more consistent paragraph structure that judges score reliably. This mirrors the Anthropic cloud finding (bundled paragraph ≥ non-bundled paragraph) and inverts the assumption that bundled always has an attention-split penalty.
Practical takeaway for local deployment: use bundled mode for paragraph output on local models. The single-call cost is lower AND the quality is higher (or equal) than non-bundled. For bullets, non-bundled is still better (qwen3.5:9b non-bundled 0.580 vs bundled 0.529) — same attention-split penalty as on cloud providers (except Anthropic).
Three storylines¶
1. Non-bundled: DeepSeek is the sleeper winner¶
DeepSeek (deepseek-chat) narrowly beats Anthropic on both non-bundled tracks:
- Bullets: 0.586 vs Anthropic 0.570 (+2.8%)
- Paragraph: 0.541 vs Anthropic 0.522 (+3.6%)
DeepSeek leads ROUGE-L by 2-4pp on both non-bundled cells while holding judge agreement. This was not obvious from v1 numbers where DeepSeek looked mid-pack. The v2 framework surfaced it.
2. Bundled: Anthropic is in a league of its own¶
Anthropic's bundled scores essentially match its non-bundled scores — the attention-split penalty that costs OpenAI ~12% and Gemini ~19% on bundled bullets simply does not materialise on Haiku 4.5. Anthropic bundled paragraph (0.548) even beats Anthropic non-bundled paragraph (0.522).
For any workload where bundled (title + summary + bullets in one call) is the preferred shape, Anthropic is the only provider where it's not a quality compromise.
6. ML and hybrid_ml pipelines under v2 (2026-04-16)¶
Historical v1 ML champions re-run under the v2 framework to isolate framework effect from model effect.
Paragraph held-out (v1 results vs v2 under same configs):
| Config | v1 ROUGE-L | v2 Dev final | v2 Held-out final | v2 Held-out ROUGE-L | v2 judge_mean |
|---|---|---|---|---|---|
| bart-led (pure ML) | 20.5% | 0.230 | 0.206 | 19.4% | 0.23 |
| hybrid bart + llama3.2:3b | 21.1% | 0.402 | 0.430 | 25.4% | 0.84 |
| hybrid bart + qwen3.5:9b (new v2 swap) | — | 0.421 | 0.448 | 26.9% | 0.87 |
Findings:
-
Framework alone doesn't save bart-led. v1 → v2 on same config = essentially flat ROUGE-L (20.5% → 19.4%). Judge-mean 0.23 tells the story: Sonnet-4.6 silver's standard applied rigorously scores BART's output as genuinely poor, and no rubric change recovers that.
-
Hybrid's REDUCE stage does most of the work. Swapping llama3.2:3b → qwen3.5:9b as REDUCE only lifted hybrid by +4% (0.430 → 0.448). Most of hybrid's score comes from the LLM REDUCE, not from BART MAP.
-
Surprising: BART MAP helps a weak REDUCE, but hurts a capable one.
| Scenario | Standalone | Hybrid | Winner |
|---|---|---|---|
| Weak REDUCE (llama3.2:3b) | 0.270 (contested) | 0.430 | Hybrid (BART preprocessing stabilises output) |
| Capable REDUCE (qwen3.5:9b) | 0.509 bundled | 0.448 | Standalone (BART chunking loses information) |
The hybrid pipeline's value is tied to the weakness of the REDUCE model. With qwen3.5:9b now available, standalone beats hybrid.
Practical consequence: the hybrid pipeline (ml_hybrid_bart_*) loses its reason to
exist as a default recommendation. Retained in docs for narrow niches (truly
memory-constrained, explainability, etc.) but qwen3.5:9b standalone bundled is the
new Tier 3 pick.
6a. ML transformers standalone (HF, not Ollama) — 2026-04-16¶
Two Hugging Face transformer models evaluated as "pure ML, no Ollama daemon" alternatives,
run via scripts/eval/experiment/run_summllama_v2.py and scripts/eval/experiment/run_longt5xl_v2.py on Apple
MPS. Scored with the same v2 harness (ROUGE-L + dual-judge blended scalar).
Paragraph:
| Model | Params | Dev final | Held-out final | Held-out ROUGE-L | Held-out judge | Contested | Avg s/ep (held-out) |
|---|---|---|---|---|---|---|---|
| SummLlama3.2-3B (DISLab, DPO-tuned) | 3B | 0.442 | 0.485 | 32.6% | 0.856 | 0/5 | 66s |
| Long-T5-XL book-summary (pszemraj) | ~3B | — | 0.192 (contested) | 19.2% | 0.527 | 3/5 | 1096s |
Bullets (SummLlama only; Long-T5-XL skipped after paragraph null result):
| Model | Dev final | Held-out final | Held-out ROUGE-L | Held-out judge | Contested | Avg s/ep (held-out) |
|---|---|---|---|---|---|---|
| SummLlama3.2-3B | 0.467 | 0.416 | 26.6% | 0.764 | 0/5 | 156s |
SummLlama3.2-3B is the headline finding — for paragraph. Same Llama-3.2-3B base that scored 0.270 (standalone, contested 5/5) — with DPO fine-tuning on faithfulness/completeness/conciseness preferences, it lifts to 0.485 held-out paragraph, zero contestation, 0.856 judge mean. That's +80% over the same-base standalone and within 5% of qwen3.5:9b bundled (0.509, the top local pick).
Bullets is weaker (0.416 held-out, -14% vs paragraph). DPO was aligned on prose
summarization preferences, not list structure. Judge_mean drops from 0.856 (paragraph) to
0.764 (bullets) — judges see paragraphs as genuinely good, bullets as merely acceptable.
Still uncontested (0/5) and above all pure-ML baselines, but below Ollama llama3.2:3b
bullets (0.501). Practical rule: prefer SummLlama for paragraph-first deployments; if
bullets are equally important, qwen3.5:9b bundled remains the better local pick (0.529
bullets / 0.509 paragraph).
Notable properties (both tracks):
- No Ollama daemon required — runs via HF
transformerson MPS directly. Smaller operational footprint than Ollama (no server, no model swap). - Zero contestation on both dev (0/10) and held-out (0/5). The DPO alignment on the rubric axes our judges score against removes the instability that plagues same-size Llama/Qwen standalone paragraph runs.
- Latency 31-66s/ep on MPS — slower than Ollama on the same laptop (llama3.2:3b via Ollama is ~12s) but operationally simpler for deployments that already use HF transformers for BART.
Long-T5-XL is a clear null result. The pszemraj/long-t5-tglobal-xl-16384-book-summary
checkpoint produces 870-2800 char summaries (BART produces ~2000-4000 chars; SummLlama
produces ~3000-5000 chars) — far shorter than our target. ROUGE-L 19%, judge 0.53, 3/5
contested. The book-summary domain transfer does not hold on podcast transcripts. Paired
with prohibitive MPS beam-search latency (1096s/ep), this is unsuitable for our workload
regardless of how predictions are scored. Run artifacts kept; no config created (run
was driven directly by scripts/eval/experiment/run_longt5xl_v2.py).
Practical consequences:
- SummLlama3.2-3B is a Tier 2.5 paragraph pick. 0.485 held-out paragraph from a 3B model via HF transformers alone — massively above bart-led (0.206) and within 5% of qwen3.5:9b bundled (0.509). Bullets is weaker (0.416) — the DPO alignment is prose- shaped. Use it when paragraph is the primary output or when Ollama can't be run.
- Pure HF-transformers paragraph pipeline is viable for deployments that don't want Ollama dependency (e.g. strict airgapped environments, embedded services). Trade-off: slower per-episode latency vs Ollama on the same hardware (~60-150s vs 12-33s).
- Tier 2 prod default (
ml_bart_led_autoresearch_v1) unchanged this session. Flipping to SummLlama requires real ML-provider integration (registry entry, factory wiring,ml_summllama32_standalone_v1mode) — that's a follow-up. The eval data justifies the flip when that work is done. - Long-T5-XL is not competitive — do not recommend for podcast summarisation.
5. Q4_0 → Q4_K_M quant upgrade tested 2026-04-16 (null-to-negative result)¶
Research predicted Q4_K_M would give "free quality" on phi3:mini, gemma2:9b, and mistral-nemo:12b (all currently Q4_0 in Ollama's defaults). Tested; not supported.
| Model | Q4_0 held-out (B/P) | Q4_K_M held-out (B/P) | Result |
|---|---|---|---|
| phi3:mini | 0.475 / 0.196 | 0.069 / 0.079 | BROKEN — 4k variant truncates; judge_mean=0 |
| gemma2:9b | 0.492 / 0.453 | 0.473 / 0.270 (contested 3/5) | Bullets slightly worse; paragraph worse |
| mistral-nemo:12b | 0.497 / 0.445 | 0.509 / 0.292 (contested 3/5) | Bullets slightly better; paragraph contests |
Notes:
- phi3:mini has two Ollama variants: 4k context (
phi3:3.8b-mini-4k-instruct-q4_K_M) and 128k context. Research recommended the 4k variant because 128k "produces gibberish above 4k tokens," but held-out prompts are ~9k tokens, so 4k truncates silently. phi3:mini is structurally too small for our workload regardless of quant. - Mistral-nemo and gemma2 show paragraph contestation appearing where Q4_0 was uncontested — at least part of the "measurement" is judge variance, not model quality. Possibly different quants tickle different judge disagreements.
- Net: don't chase Q4 quant upgrades on these specific models for this workload. If re-quantising, go straight to Q6_K / Q8_0 for the small models and use the 128k-context phi3 variant to at least avoid truncation.
Quant upgrade configs kept (traceable as _q4km variants) but not recommended for use.
4. Cheaper-tier variants added 2026-04-16¶
Three cheaper-tier variants tested after the initial sweep, both non-bundled and bundled:
Non-bundled (held-out):
| Model | Bullets | Paragraph | vs comparable existing | Verdict |
|---|---|---|---|---|
| gemini-2.5-flash-lite | 0.564 | 0.479 | 2.0-flash: 0.562 / 0.463 | Strict upgrade — better on both tracks, same cost tier |
| gpt-4o-mini | 0.540 | 0.469 | gpt-4o: 0.566 / 0.481 (16× more expensive) | Viable cost-tier pick; 4% quality hit buys 16× cost savings |
| mistral-medium | 0.488 | 0.421 | mistral-small: 0.537 / 0.456 | Poor on non-bundled; but see bundled below |
Bundled (held-out):
| Model | Bullets | Paragraph | Δ vs its own non-bundled | Verdict |
|---|---|---|---|---|
| gemini-2.5-flash-lite | 0.536 | 0.462 | bullets −5%, paragraph −4% | Same "bundled < non-bundled" pattern as other Geminis/OpenAIs |
| gpt-4o-mini | 0.483 | 0.459 | bullets −10.6% | Classic OpenAI bundled attention-split penalty |
| mistral-medium | 0.533 | 0.480 | bullets +9%, paragraph +14%! | Flip: good in bundled, bad non-bundled. Joins Anthropic/Ollama-Qwen as "bundled > non-bundled" |
Practical consequences:
- gemini-2.5-flash-lite non-bundled becomes the new "balanced default" pick — replacing gemini-2.0-flash
- gpt-4o-mini non-bundled is a reasonable 16×-cheaper gpt-4o alternative; bundled is not
- mistral-medium is only viable in bundled mode — a genuine surprise that adds a 4th provider to the "bundled is fine" club (Anthropic, Ollama Qwen, now mistral-medium)
Practical consequence: gemini-2.5-flash-lite should replace 2.0-flash as the cheap-tier Gemini pick in the recommendations. gpt-4o-mini is a defensible OpenAI pick for cost-sensitive deployments; note that champion prompts were tuned for gpt-4o specifically — dedicated prompt tuning for gpt-4o-mini could close some of the 4% quality gap.
3. Gemini is the balanced-default champion once latency/cost are counted¶
DeepSeek and Anthropic win on quality; Gemini wins on everything else. At only ~4% below top quality (0.562 vs 0.586), it is:
- 2-5× faster than all other cloud providers (2.0s/ep vs 4.8-19s elsewhere)
- Cheapest cloud option ($0.00035/ep vs $0.0005-$0.012 elsewhere)
- On every Pareto frontier in the quality/latency/cost space
For any application where you care about all three dimensions, Gemini is the default first pick. Upgrade to DeepSeek only when the 4% quality premium matters more than 5× latency and 50% cost.
Bundled viability per provider (non-bundled − bundled, higher = bigger bundled penalty)¶
| Provider | Bullets gap | Paragraph gap | Bundled verdict |
|---|---|---|---|
| Anthropic | +0.018 (+3.3%) | −0.026 (bundled wins) | Bundled viable; best bundled quality by a wide margin |
| DeepSeek | +0.075 (+14.7%) | +0.018 (+3.5%) | Paragraph bundled ok; bullets noticeably worse |
| OpenAI | +0.061 (+12.1%) | +0.012 (+2.6%) | Clear bullets penalty |
| Grok | +0.064 (+13.1%) | −0.002 (~tied) | Similar to OpenAI — bullets penalty, paragraph ~same |
| Mistral | +0.050 (+10.3%) | −0.031 (bundled wins paragraph) | Small bullets penalty; paragraph actually prefers bundled |
| Gemini | +0.089 (+18.8%) | +0.002 (+0.4%) | Largest bullets penalty |
Three providers (Anthropic, Mistral, DeepSeek) have ≤3% paragraph penalty in bundled — bundled is a legitimate choice there. OpenAI, Grok, Gemini show the classic "attention split" penalty.
Cloud pricing reference (Mar 2026, $/M tokens)¶
| Provider | Model | $/M in | $/M out |
|---|---|---|---|
| Gemini | gemini-2.0-flash | $0.075 | $0.30 |
| DeepSeek | deepseek-chat | $0.14 | $0.28 |
| Mistral | mistral-small-latest | $0.20 | $0.60 |
| Grok | grok-3-mini | $0.30 | $0.50 |
| Anthropic | claude-haiku-4-5 | $0.80 | $4.00 |
| OpenAI | gpt-4o | $2.50 | $10.00 |
Ordered cheapest → most expensive (output tokens): Gemini < DeepSeek < Mistral < Grok < Anthropic Haiku < OpenAI GPT-4o. 33× spread from cheapest to most expensive.
Actual measured per-episode latency and cost are in the Full matrix above. See Compound analysis for recommended option order.
Provider-specific findings¶
OpenAI (gpt-4o)¶
- Strengths: Reliable, good non-bundled quality, strong tooling ecosystem, json_object mode for bundled.
- Weaknesses: Expensive. Bundled attention-split penalty is real.
temperature=0is not fully deterministic even with seed. - Quirks: Implemented seed plumbing (
openai_summary_seed) — helps contestation stability but not full reproducibility.
Anthropic (claude-haiku-4-5-20251001)¶
- Strengths: Best bundled quality by wide margin. Both tracks competitive. Judge agreement highest.
- Weaknesses: More expensive than DeepSeek/Mistral/Gemini. No seed param in API.
- Quirks: API has no
response_format: json_object; bundled requires JSON prefill ({"role": "assistant", "content": "{"}) — implemented in provider. No seed; relies on temp=0 (empirically more deterministic than OpenAI's temp=0).
Gemini (gemini-2.0-flash)¶
- Strengths: Cheapest by far on output tokens. Fast latency. Non-bundled bullets quality is close to top.
- Weaknesses: Weakest bundled quality. Paragraph lags.
- Quirks:
- Bundled JSON responses sometimes contain raw control characters — parser uses
json.loads(strict=False). - Gemini 2.5-flash (non-lite): Older
google-genai0.x could not setThinkingConfig.thinking_budget, so thinking tokens starved real output. The provider now depends ongoogle-genai1.x and sendsthinking_budget=0forgemini-2.5-flash(excluding…-flash-lite). This report’s numbers are still from 2.0-flash / 2.5-flash-lite until the four-cell autoresearch re-run ongemini-2.5-flash(GitHub #572).
Mistral (mistral-small-latest)¶
- Strengths: Mid-tier across all cells. Bundled paragraph surprisingly good (0.487 — beats OpenAI bundled paragraph 0.469).
- Weaknesses: Non-bundled paragraph is worst in the matrix (0.456). No ecosystem edge.
- Quirks: No API-specific issues encountered during v2 application.
DeepSeek (deepseek-chat)¶
- Strengths: Best non-bundled quality on both tracks. Very cheap ($0.28/M output).
- Weaknesses: API reliability — bundled calls sometimes time out under load (retries work). No seed param. max_tokens hard-capped at 8192 (provider clamps now).
- Quirks:
max_tokens > 8192returns 400; we cap at 8192 in the bundled path.temperature=0appears reasonably deterministic.
Grok (grok-3-mini)¶
- Strengths: Mid-pack across all cells. No surprises.
- Weaknesses: Nothing stands out — third or fourth in every cell. No compelling reason to pick over Anthropic (quality) or DeepSeek (cost).
- Quirks: None encountered.
Ollama (local, 11 models evaluated)¶
- Strengths:
qwen3.5:9bmatches DeepSeek cloud quality at $0/ep. Free, private, offline. Bundled mode fixes paragraph contestation (unlike non-bundled where 5 of 11 models contest on long-form held-out). - Weaknesses: Latency 12-80s/ep (cloud is 2-20s). Larger models often not better — qwen3.5:9b outperforms both 27b and 35b variants. Not all models reliably produce JSON for bundled.
- Quirks:
- JSON parser uses
strict=Falsedefensively (some models emit control characters). - Size-vs-quality: bigger ≠ better. qwen3.5 family peaks at 9B; mistral:7b beats qwen2.5:7b (generation matters more than size).
- Paragraph contestation unpredictable: llama3.1:8b contested 5/5 episodes on held-out paragraph; uncontested uncorrelated with size.
Framework validation (across 17 model variants)¶
All champions were validated under the v2 framework across 6 cloud providers + 11 local Ollama models:
- Champion prompts transferred cleanly across all 6 cloud providers and 11 local models. Zero provider-specific prompt tuning was required — the OpenAI champion prompts produced competitive numbers on every provider tested.
- All champions generalise on held-out content. Cloud: 24 champions, dev→held-out deltas within ±5%. Local: 11 champions + 3 bundled, same generalisation property.
- Framework reliability: 23 of 24 cloud cells and all 44 local cells ran cleanly first try (1 DeepSeek bundled transient API timeout retried successfully). One Ollama model (qwen3.5:27b held-out bullets) showed anomalous 505s latency — likely model-swap overhead, not a framework problem.
- Judge contestation: Fraction-based threshold (≥40%) correctly rejected high-divergence runs without flipping on single-episode noise. Cloud: zero runs fell back to ROUGE-only. Local non-bundled paragraph: 5 of 11 models triggered contestation, surfacing a real local-model behaviour (mitigated by bundled mode).
The framework's central claim — that these numbers are trustworthy for cross-provider decision making — is now backed by 68 independent held-out validations (24 cloud + 44 local). Ship it.
What changed from v1 (before this session)¶
- Framework: v1 had
curated_5feeds_smoke_v1⊂curated_5feeds_benchmark_v1(contaminated validation). v2 uses disjointdev_v1(e01+e02) + held-outbenchmark_v2(e03 only). - Contestation: v1 binary-OR flipped runs to ROUGE-only on any divergent episode. v2 fraction-based threshold (40%) absorbs single-episode noise.
- Rubric: v1 penalised long summaries via Conciseness dimension. v2 Efficiency dimension rewards content density without length penalty.
- Prompt extraction: v2 extracts JSON prose before judging so bundled outputs are judged on semantic content, not JSON formatting.
- Seed: OpenAI seed parameter plumbed through Config/Params/factory (partial mitigation for API non-determinism).
- Prompt improvements: Champion prompts from OpenAI r7 tuning + paragraph v2 tuning ported to all providers (few-shot bullets, style narration, anti-patterns, 4-6 para default, opening sentence, coverage, verbatim terminology).
Earlier v1 reports (e.g. autoresearch/openai_comparison_2026-04-14.md) are superseded by
this one. v1 used contaminated validation and should not be cited as authoritative.
Open items / deferred¶
-
Multi-run averaging (N=3 per experiment). Would tighten confidence intervals on the numbers above and fully absorb API non-determinism. ~3× compute cost. RFC-073 §Future Work.
-
Gemini 2.5-flash with thinking disabled. Likely ~3-5% quality lift across all Gemini cells. Blocked by
google-genaiSDK version pin. -
Provider-specific prompt tuning. All 6 providers run champion-ported prompts — no dedicated tuning. Expected 2-5% additional per cell with per-provider tuning. Low priority given DeepSeek and Anthropic already dominate their respective cells.
-
Larger held-out dataset. 5 episodes give ±5% noise on held-out. Would help distinguish similarly-good champions (e.g., DeepSeek vs Anthropic non-bundled bullets differs by 2.8%, within noise bounds).
-
Sonnet 4.6 as candidate (currently only silver). Would reveal how much headroom exists above Haiku 4.5 and DeepSeek. Not urgent given Haiku + DeepSeek already strong.
-
Ollama / local model providers (llama31, llama32, qwen, phi3, gemma, etc.). v1 had extensive ollama configs; v2 has not yet applied. Valuable for on-device deployment decisions.
-
Champion consolidation. We now have 24 (provider × mode × track) champion combinations. Long-term, only a handful matter for production. Pick 3-4 primary recommendations and mark the rest as reference-only.