Skip to content

Evaluation Report: Held-out v2 (April 2026)

Authoritative 6-provider comparison under the autoresearch v2 framework. Dev/held-out dataset split, champion prompts ported across providers, dual-judge scoring. Supersedes the v1 benchmark report and v1 smoke report for cloud API providers; v1 reports remain authoritative for Ollama and hybrid_ml pending v2 re-run.

Field Value
Date April 2026 (2026-04-15)
Framework autoresearch v2 (RFC-073)
Dev dataset curated_5feeds_dev_v1 (10 ep, e01+e02 per feed) — iteration only
Held-out dataset curated_5feeds_benchmark_v2 (5 ep, e03 per feed, ~32 min each) — never used during tuning
Silver (paragraphs) silver_sonnet46_dev_v1_paragraph, silver_sonnet46_benchmark_v2_paragraph
Silver (bullets) silver_sonnet46_dev_v1_bullets, silver_sonnet46_benchmark_v2_bullets
Judges gpt-4o-mini + claude-haiku-4-5-20251001 (dual, fraction-based contestation)
Providers evaluated OpenAI, Anthropic, Gemini, Mistral, DeepSeek, Grok

For metric definitions and interpretation guidance, see the Evaluation Reports index. For the framework design rationale (why dev/held-out split, why fraction-based contestation, etc.), see RFC-073.


Six API providers evaluated end-to-end under the same framework with the same champion prompts (ported from OpenAI r7 champion). All champions validated on held-out content they were never tuned against. This is the authoritative reference for provider selection.


Headline Matrix (held-out, authoritative)

Final blended score = 0.70 * ROUGE-L + 0.30 * judge_mean. Higher better.

Track Approach OpenAI Anthropic Gemini Mistral DeepSeek Grok
Bullets Bundled 0.505 0.552 0.473 0.487 0.511 0.489
Bullets Non-bundled 0.566 0.570 0.562 0.537 0.586 0.553
Paragraph Bundled 0.469 0.548 0.461 0.487 0.523 0.481
Paragraph Non-bundled 0.481 0.522 0.463 0.456 0.541 0.479

Cell winners on quality alone (held-out):

Cell Winner Score
Bullets · Bundled Anthropic haiku-4.5 0.552
Bullets · Non-bundled DeepSeek deepseek-chat 0.586
Paragraph · Bundled Anthropic haiku-4.5 0.548
Paragraph · Non-bundled DeepSeek deepseek-chat 0.541

Quality is one dimension. Once latency and cost are counted, Gemini 2.0-flash is the default pick for balanced use cases (0.562 quality, 2.0s/ep, $0.00035/ep). See the Compound analysis § Pareto frontier and Recommended option order by use case for full quality × latency × cost tradeoffs.

ROUGE-L breakdown (held-out)

Track Approach OpenAI Anthropic Gemini Mistral DeepSeek Grok
Bullets Bundled 33.2% 39.3% 28.5% 30.4% 34.4% 30.5%
Bullets Non-bundled 39.6% 40.7% 40.1% 37.3% 43.1% 38.6%
Paragraph Bundled 29.5% 39.2% 26.6% 30.3% 35.9% 28.9%
Paragraph Non-bundled 31.7% 36.4% 29.3% 27.8% 40.7% 31.3%

Dev scores (for generalisation sanity-check)

Track Approach OpenAI Anthropic Gemini Mistral DeepSeek Grok
Bullets Bundled 0.476 0.549 0.489 0.475 0.519 0.502
Bullets Non-bundled 0.564 0.572 0.566 0.521 0.572 0.511
Paragraph Bundled 0.467 0.523 0.462 0.479 0.479 0.441
Paragraph Non-bundled 0.460 0.526 0.472 0.456 0.536 0.478

All 24 champions generalise within ±5% dev→held-out. No overfitting caught.


Local models (Ollama)

Eleven local models evaluated as a full local-vs-cloud comparison. Same champion prompts ported from OpenAI r7. Non-bundled for all eleven; bundled also evaluated for the three 9B-35B models with reliable JSON mode (qwen3.5:9b, qwen3.5:35b, mistral-small3.2). qwen2.5:32b v2 not evaluated — qwen3.5 generation (9b, 27b, 35b) covers the Qwen family; qwen2.5:32b would not change recommendations.

Held-out non-bundled (5 ep e03) — sorted by bullets final

Model Size Bullets ROUGE-L Bullets final Paragraph ROUGE-L Paragraph final
qwen3.5:9b 9B 42.8% 0.580 34.8% 0.505
qwen3.5:35b 35B 41.3% 0.576 32.5% 0.325 (contested)
qwen3.5:27b 27B 36.3% 0.543 35.0% 0.499
mistral-small3.2 ~22B 36.1% 0.536 28.8% 0.288 (contested)
mistral:7b 7B 36.2% 0.526 31.2% 0.475
llama3.1:8b 8B 33.8% 0.518 30.5% 0.305 (contested 5/5)
llama3.2:3b 3B 33.0% 0.501 27.0% 0.270 (contested)
mistral-nemo:12b 12B 30.4% 0.497 27.1% 0.445
gemma2:9b 9B 30.3% 0.492 28.0% 0.453
qwen2.5:7b 7B 29.6% 0.477 28.6% 0.463
phi3:mini 3.8B 31.9% 0.475 19.6% 0.196 (contested)

Dev non-bundled (10 ep e01+e02)

Model Bullets Paragraph
qwen3.5:9b 0.571 0.493
qwen3.5:35b 0.560 0.491
qwen3.5:27b 0.545 0.512
mistral-small3.2 0.551 0.497
mistral:7b 0.512 0.472
llama3.2:3b 0.533 0.441
llama3.1:8b 0.497 0.455
mistral-nemo:12b 0.505 0.450
gemma2:9b 0.495 0.459
qwen2.5:7b 0.461 0.443
phi3:mini 0.483 0.209 (contested)

Held-out bundled (5 ep e03)

Model Bullets ROUGE-L Bullets final Paragraph ROUGE-L Paragraph final
qwen3.5:9b 35.8% 0.529 33.1% 0.509
qwen3.5:35b 33.3% 0.514 30.8% 0.492
mistral-small3.2 30.3% 0.488 27.5% 0.468

All 12 bundled cells ran cleanly with zero contestation — a stark contrast to the non-bundled paragraph track where 3 of 4 models contested. The structured JSON schema stabilises output across judges.

Dev (10 ep e01+e02)

Model Bullets final Paragraph final
qwen3.5:9b 0.571 0.493
qwen3.5:35b 0.560 0.491
mistral-small3.2 0.551 0.497
llama3.2:3b 0.533 0.441

Full matrix with latency and cost (non-bundled bullets held-out)

Per-episode latency from actual run metrics (baseline.json stats.avg_time_seconds). Cost per episode computed from approximate transcript size (~2.7k input tokens + ~500 output tokens) and Mar 2026 pricing.

Rank Provider+model Final ROUGE-L Latency $/ep Q/sec Q/$
1 DeepSeek (deepseek-chat) 0.586 43.1% 10.2s $0.00052 0.058 1131
2 Ollama qwen3.5:9b 0.580 42.8% 33.3s $0 0.017
3 Ollama qwen3.5:35b 0.576 41.3% 23.1s $0 0.025
4 Anthropic (haiku-4.5) 0.570 40.7% 4.8s $0.00416 0.119 137
5 OpenAI (gpt-4o) 0.566 39.6% 4.6s $0.01175 0.123 48
6 Gemini 2.5-flash-lite (new) 0.564 39.6% 1.5s ~$0.00047 0.376 1200
7 Gemini 2.0-flash 0.562 40.1% 2.0s $0.00035 0.281 1606
8 Grok (grok-3-mini) 0.553 38.6% 19.4s $0.00106 0.029 522
9 Ollama qwen3.5:27b† 0.543 36.3% 505s† $0 0.001†
10 gpt-4o-mini (new) 0.540 37.1% 6.6s ~$0.00074 0.082 730
11 Mistral (mistral-small-latest) 0.537 37.3% 2.2s $0.00084 0.244 639
12 Ollama mistral-small3.2 0.536 36.1% 79.2s $0 0.007
13 Ollama mistral:7b 0.526 36.2% 28.9s $0 0.018
14 Ollama llama3.1:8b 0.518 33.8% 24.5s $0 0.021
15 Ollama llama3.2:3b 0.501 33.0% 12.2s $0 0.041
16 Ollama mistral-nemo:12b 0.497 30.4% 33.4s $0 0.015
17 Ollama gemma2:9b 0.492 30.3% 28.6s $0 0.017
18 Mistral-medium (new, surprisingly weak) 0.488 29.1% 5.9s ~higher worse
19 Ollama qwen2.5:7b 0.477 29.6% 22.9s $0 0.021
20 Ollama phi3:mini 0.475 31.9% 17.7s $0 0.027

†qwen3.5:27b bullets held-out shows anomalous 505s — likely cold-start / model swap overhead from Ollama. Paragraph cell on same model was 188s, more typical. Treat this cell as warm-up noise; realistic per-episode latency is probably in the ~100s range.

Compound analysis — Pareto frontier

Three dimensions worth considering: quality (final score), latency (s/ep), cost ($/ep). A pick is on the Pareto frontier if no other option is strictly better on all three.

Pareto-optimal cloud:

  • Gemini 2.0-flash — cheapest + fastest + not bottom-quality. Dominates on cost + latency.
  • Anthropic haiku-4.5 — 2nd fastest, 4th quality, mid cost. Middle-of-frontier.
  • DeepSeek — #1 quality, middling latency, very cheap. Quality-first frontier.

Pareto-optimal local:

  • Ollama qwen3.5:9b — #1 local quality at moderate local latency.
  • Ollama llama3.2:3b — fastest local (12s) at acceptable quality (0.501).

Dominated picks (avoid unless ecosystem-locked):

  • OpenAI gpt-4o: Anthropic has higher quality, same latency, 3× cheaper. Nothing gained.
  • Grok: slower than Anthropic, lower quality. No wins.
  • Mistral cloud: faster than DeepSeek but lower quality; Gemini beats it on all three axes.
  • Most local models (mistral:7b, llama3.1:8b, qwen2.5:7b, gemma2:9b, mistral-nemo:12b, phi3, mistral-small3.2, qwen3.5:27b, qwen3.5:35b): dominated by qwen3.5:9b on quality or llama3.2:3b on speed.

A. Quality first — you can pay, you can wait.

  1. DeepSeek non-bundled (0.586, 10s, $0.0005) — top quality, very cheap, worst-case a few seconds.
  2. Anthropic haiku-4.5 bundled (0.552, 7s, $0.004) — if you need single-call title+summary+bullets.
  3. Ollama qwen3.5:9b (0.580, 33s) — only if cloud is off the table.

B. Cost first — cloud, quality secondary.

  1. Gemini 2.0-flash non-bundled bullets (0.562, 2s, $0.00035) — 3× cheaper than DeepSeek, almost as fast, ~4% lower quality.
  2. DeepSeek non-bundled (0.586, 10s, $0.0005) — for 50% more cost, +4% quality, 5× slower.
  3. Mistral cloud (0.537, 2s, $0.00084) — only if Gemini locked out for some reason.

C. Throughput / latency first — batch processing, real-time serving.

  1. Gemini 2.0-flash (2.0s) — 2× faster than Anthropic, 5× faster than DeepSeek. Cheapest too.
  2. Mistral cloud (2.2s) — similar latency, weaker quality.
  3. Anthropic haiku-4.5 (4.8s) — fastest on the quality frontier.

D. Privacy / offline first — local only, no external calls.

  1. Ollama qwen3.5:9b (0.580, 33s) — best local quality. Close to DeepSeek cloud.
  2. Ollama llama3.2:3b (0.501, 12s) — for resource-constrained devices; faster, lower quality.
  3. Ollama mistral:7b (0.526, 29s) — only non-Qwen/Llama worth considering locally.

E. Balanced — you want all three: reasonable quality + fast + cheap.

  1. Anthropic haiku-4.5 non-bundled (0.570, 4.8s, $0.004) — on the frontier for balance.
  2. Gemini 2.0-flash non-bundled (0.562, 2s, $0.00035) — slightly worse quality, massively better latency+cost. Almost always the right balanced pick.
  3. DeepSeek non-bundled (0.586, 10s, $0.0005) — quality premium over Gemini for 5× latency.

The short answer

For most production deployments, Gemini 2.0-flash non-bundled bullets is the correct first pick. It sits on every Pareto frontier, wins on latency and cost, and loses only ~4% on quality vs the absolute best. Upgrade to DeepSeek (quality), Anthropic bundled (single-call), or Ollama qwen3.5:9b (privacy) only when a specific dimension justifies the tradeoff.

Headline finding: qwen3.5:9b (open-weights, local, free) lands 2nd in the whole matrix for bullets held-out, 0.3pp ROUGE-L behind DeepSeek. On-prem / offline deployments can essentially match the best cloud option on bullets quality. Seven of the top 11 entries are local — local models are fully competitive with cloud APIs on this workload.

Size vs quality for Qwen3.5 family: 9B (0.580) > 35B (0.576) > 27B (0.543). The smallest is the strongest — suggests the 9B is the pareto-optimal choice and the larger models add cost without proportional gain. Similar pattern at 7B size: mistral:7b (0.526) noticeably beats qwen2.5:7b (0.477) — family generation matters more than size at this scale.

Non-bundled paragraph contests on several local models; bundled fixes it

Five of eleven local models contested on held-out non-bundled paragraph (judges diverged >40% of episodes): llama3.2:3b, llama3.1:8b (5/5!), qwen3.5:35b, mistral-small3.2, phi3:mini. Six models were uncontested: qwen3.5:9b (2/5 under threshold), qwen3.5:27b, mistral:7b, mistral-nemo:12b, gemma2:9b, qwen2.5:7b.

Pattern: contestation is not strictly tied to size or generation. Contestation correlates with structural inconsistency across episodes, which is somewhat model-specific and hard to predict.

When the 3 largest bundled-evaluated models are run in bundled mode (same prompt content, wrapped in the JSON schema), all three produce uncontested paragraph output at meaningfully higher final scores:

Model Non-bundled paragraph final Bundled paragraph final Gain
qwen3.5:9b 0.505 (2/5 contested) 0.509 +0.8% (and stable)
qwen3.5:35b 0.325 (contested, ROUGE-only) 0.492 +51%
mistral-small3.2 0.288 (contested, ROUGE-only) 0.468 +63%

The JSON schema appears to force local models into more consistent paragraph structure that judges score reliably. This mirrors the Anthropic cloud finding (bundled paragraph ≥ non-bundled paragraph) and inverts the assumption that bundled always has an attention-split penalty.

Practical takeaway for local deployment: use bundled mode for paragraph output on local models. The single-call cost is lower AND the quality is higher (or equal) than non-bundled. For bullets, non-bundled is still better (qwen3.5:9b non-bundled 0.580 vs bundled 0.529) — same attention-split penalty as on cloud providers (except Anthropic).


Three storylines

1. Non-bundled: DeepSeek is the sleeper winner

DeepSeek (deepseek-chat) narrowly beats Anthropic on both non-bundled tracks:

  • Bullets: 0.586 vs Anthropic 0.570 (+2.8%)
  • Paragraph: 0.541 vs Anthropic 0.522 (+3.6%)

DeepSeek leads ROUGE-L by 2-4pp on both non-bundled cells while holding judge agreement. This was not obvious from v1 numbers where DeepSeek looked mid-pack. The v2 framework surfaced it.

2. Bundled: Anthropic is in a league of its own

Anthropic's bundled scores essentially match its non-bundled scores — the attention-split penalty that costs OpenAI ~12% and Gemini ~19% on bundled bullets simply does not materialise on Haiku 4.5. Anthropic bundled paragraph (0.548) even beats Anthropic non-bundled paragraph (0.522).

For any workload where bundled (title + summary + bullets in one call) is the preferred shape, Anthropic is the only provider where it's not a quality compromise.

6. ML and hybrid_ml pipelines under v2 (2026-04-16)

Historical v1 ML champions re-run under the v2 framework to isolate framework effect from model effect.

Paragraph held-out (v1 results vs v2 under same configs):

Config v1 ROUGE-L v2 Dev final v2 Held-out final v2 Held-out ROUGE-L v2 judge_mean
bart-led (pure ML) 20.5% 0.230 0.206 19.4% 0.23
hybrid bart + llama3.2:3b 21.1% 0.402 0.430 25.4% 0.84
hybrid bart + qwen3.5:9b (new v2 swap) 0.421 0.448 26.9% 0.87

Findings:

  1. Framework alone doesn't save bart-led. v1 → v2 on same config = essentially flat ROUGE-L (20.5% → 19.4%). Judge-mean 0.23 tells the story: Sonnet-4.6 silver's standard applied rigorously scores BART's output as genuinely poor, and no rubric change recovers that.

  2. Hybrid's REDUCE stage does most of the work. Swapping llama3.2:3b → qwen3.5:9b as REDUCE only lifted hybrid by +4% (0.430 → 0.448). Most of hybrid's score comes from the LLM REDUCE, not from BART MAP.

  3. Surprising: BART MAP helps a weak REDUCE, but hurts a capable one.

Scenario Standalone Hybrid Winner
Weak REDUCE (llama3.2:3b) 0.270 (contested) 0.430 Hybrid (BART preprocessing stabilises output)
Capable REDUCE (qwen3.5:9b) 0.509 bundled 0.448 Standalone (BART chunking loses information)

The hybrid pipeline's value is tied to the weakness of the REDUCE model. With qwen3.5:9b now available, standalone beats hybrid.

Practical consequence: the hybrid pipeline (ml_hybrid_bart_*) loses its reason to exist as a default recommendation. Retained in docs for narrow niches (truly memory-constrained, explainability, etc.) but qwen3.5:9b standalone bundled is the new Tier 3 pick.

6a. ML transformers standalone (HF, not Ollama) — 2026-04-16

Two Hugging Face transformer models evaluated as "pure ML, no Ollama daemon" alternatives, run via scripts/eval/experiment/run_summllama_v2.py and scripts/eval/experiment/run_longt5xl_v2.py on Apple MPS. Scored with the same v2 harness (ROUGE-L + dual-judge blended scalar).

Paragraph:

Model Params Dev final Held-out final Held-out ROUGE-L Held-out judge Contested Avg s/ep (held-out)
SummLlama3.2-3B (DISLab, DPO-tuned) 3B 0.442 0.485 32.6% 0.856 0/5 66s
Long-T5-XL book-summary (pszemraj) ~3B 0.192 (contested) 19.2% 0.527 3/5 1096s

Bullets (SummLlama only; Long-T5-XL skipped after paragraph null result):

Model Dev final Held-out final Held-out ROUGE-L Held-out judge Contested Avg s/ep (held-out)
SummLlama3.2-3B 0.467 0.416 26.6% 0.764 0/5 156s

SummLlama3.2-3B is the headline finding — for paragraph. Same Llama-3.2-3B base that scored 0.270 (standalone, contested 5/5) — with DPO fine-tuning on faithfulness/completeness/conciseness preferences, it lifts to 0.485 held-out paragraph, zero contestation, 0.856 judge mean. That's +80% over the same-base standalone and within 5% of qwen3.5:9b bundled (0.509, the top local pick).

Bullets is weaker (0.416 held-out, -14% vs paragraph). DPO was aligned on prose summarization preferences, not list structure. Judge_mean drops from 0.856 (paragraph) to 0.764 (bullets) — judges see paragraphs as genuinely good, bullets as merely acceptable. Still uncontested (0/5) and above all pure-ML baselines, but below Ollama llama3.2:3b bullets (0.501). Practical rule: prefer SummLlama for paragraph-first deployments; if bullets are equally important, qwen3.5:9b bundled remains the better local pick (0.529 bullets / 0.509 paragraph).

Notable properties (both tracks):

  • No Ollama daemon required — runs via HF transformers on MPS directly. Smaller operational footprint than Ollama (no server, no model swap).
  • Zero contestation on both dev (0/10) and held-out (0/5). The DPO alignment on the rubric axes our judges score against removes the instability that plagues same-size Llama/Qwen standalone paragraph runs.
  • Latency 31-66s/ep on MPS — slower than Ollama on the same laptop (llama3.2:3b via Ollama is ~12s) but operationally simpler for deployments that already use HF transformers for BART.

Long-T5-XL is a clear null result. The pszemraj/long-t5-tglobal-xl-16384-book-summary checkpoint produces 870-2800 char summaries (BART produces ~2000-4000 chars; SummLlama produces ~3000-5000 chars) — far shorter than our target. ROUGE-L 19%, judge 0.53, 3/5 contested. The book-summary domain transfer does not hold on podcast transcripts. Paired with prohibitive MPS beam-search latency (1096s/ep), this is unsuitable for our workload regardless of how predictions are scored. Run artifacts kept; no config created (run was driven directly by scripts/eval/experiment/run_longt5xl_v2.py).

Practical consequences:

  1. SummLlama3.2-3B is a Tier 2.5 paragraph pick. 0.485 held-out paragraph from a 3B model via HF transformers alone — massively above bart-led (0.206) and within 5% of qwen3.5:9b bundled (0.509). Bullets is weaker (0.416) — the DPO alignment is prose- shaped. Use it when paragraph is the primary output or when Ollama can't be run.
  2. Pure HF-transformers paragraph pipeline is viable for deployments that don't want Ollama dependency (e.g. strict airgapped environments, embedded services). Trade-off: slower per-episode latency vs Ollama on the same hardware (~60-150s vs 12-33s).
  3. Tier 2 prod default (ml_bart_led_autoresearch_v1) unchanged this session. Flipping to SummLlama requires real ML-provider integration (registry entry, factory wiring, ml_summllama32_standalone_v1 mode) — that's a follow-up. The eval data justifies the flip when that work is done.
  4. Long-T5-XL is not competitive — do not recommend for podcast summarisation.

5. Q4_0 → Q4_K_M quant upgrade tested 2026-04-16 (null-to-negative result)

Research predicted Q4_K_M would give "free quality" on phi3:mini, gemma2:9b, and mistral-nemo:12b (all currently Q4_0 in Ollama's defaults). Tested; not supported.

Model Q4_0 held-out (B/P) Q4_K_M held-out (B/P) Result
phi3:mini 0.475 / 0.196 0.069 / 0.079 BROKEN — 4k variant truncates; judge_mean=0
gemma2:9b 0.492 / 0.453 0.473 / 0.270 (contested 3/5) Bullets slightly worse; paragraph worse
mistral-nemo:12b 0.497 / 0.445 0.509 / 0.292 (contested 3/5) Bullets slightly better; paragraph contests

Notes:

  • phi3:mini has two Ollama variants: 4k context (phi3:3.8b-mini-4k-instruct-q4_K_M) and 128k context. Research recommended the 4k variant because 128k "produces gibberish above 4k tokens," but held-out prompts are ~9k tokens, so 4k truncates silently. phi3:mini is structurally too small for our workload regardless of quant.
  • Mistral-nemo and gemma2 show paragraph contestation appearing where Q4_0 was uncontested — at least part of the "measurement" is judge variance, not model quality. Possibly different quants tickle different judge disagreements.
  • Net: don't chase Q4 quant upgrades on these specific models for this workload. If re-quantising, go straight to Q6_K / Q8_0 for the small models and use the 128k-context phi3 variant to at least avoid truncation.

Quant upgrade configs kept (traceable as _q4km variants) but not recommended for use.

4. Cheaper-tier variants added 2026-04-16

Three cheaper-tier variants tested after the initial sweep, both non-bundled and bundled:

Non-bundled (held-out):

Model Bullets Paragraph vs comparable existing Verdict
gemini-2.5-flash-lite 0.564 0.479 2.0-flash: 0.562 / 0.463 Strict upgrade — better on both tracks, same cost tier
gpt-4o-mini 0.540 0.469 gpt-4o: 0.566 / 0.481 (16× more expensive) Viable cost-tier pick; 4% quality hit buys 16× cost savings
mistral-medium 0.488 0.421 mistral-small: 0.537 / 0.456 Poor on non-bundled; but see bundled below

Bundled (held-out):

Model Bullets Paragraph Δ vs its own non-bundled Verdict
gemini-2.5-flash-lite 0.536 0.462 bullets −5%, paragraph −4% Same "bundled < non-bundled" pattern as other Geminis/OpenAIs
gpt-4o-mini 0.483 0.459 bullets −10.6% Classic OpenAI bundled attention-split penalty
mistral-medium 0.533 0.480 bullets +9%, paragraph +14%! Flip: good in bundled, bad non-bundled. Joins Anthropic/Ollama-Qwen as "bundled > non-bundled"

Practical consequences:

  • gemini-2.5-flash-lite non-bundled becomes the new "balanced default" pick — replacing gemini-2.0-flash
  • gpt-4o-mini non-bundled is a reasonable 16×-cheaper gpt-4o alternative; bundled is not
  • mistral-medium is only viable in bundled mode — a genuine surprise that adds a 4th provider to the "bundled is fine" club (Anthropic, Ollama Qwen, now mistral-medium)

Practical consequence: gemini-2.5-flash-lite should replace 2.0-flash as the cheap-tier Gemini pick in the recommendations. gpt-4o-mini is a defensible OpenAI pick for cost-sensitive deployments; note that champion prompts were tuned for gpt-4o specifically — dedicated prompt tuning for gpt-4o-mini could close some of the 4% quality gap.

3. Gemini is the balanced-default champion once latency/cost are counted

DeepSeek and Anthropic win on quality; Gemini wins on everything else. At only ~4% below top quality (0.562 vs 0.586), it is:

  • 2-5× faster than all other cloud providers (2.0s/ep vs 4.8-19s elsewhere)
  • Cheapest cloud option ($0.00035/ep vs $0.0005-$0.012 elsewhere)
  • On every Pareto frontier in the quality/latency/cost space

For any application where you care about all three dimensions, Gemini is the default first pick. Upgrade to DeepSeek only when the 4% quality premium matters more than 5× latency and 50% cost.


Bundled viability per provider (non-bundled − bundled, higher = bigger bundled penalty)

Provider Bullets gap Paragraph gap Bundled verdict
Anthropic +0.018 (+3.3%) −0.026 (bundled wins) Bundled viable; best bundled quality by a wide margin
DeepSeek +0.075 (+14.7%) +0.018 (+3.5%) Paragraph bundled ok; bullets noticeably worse
OpenAI +0.061 (+12.1%) +0.012 (+2.6%) Clear bullets penalty
Grok +0.064 (+13.1%) −0.002 (~tied) Similar to OpenAI — bullets penalty, paragraph ~same
Mistral +0.050 (+10.3%) −0.031 (bundled wins paragraph) Small bullets penalty; paragraph actually prefers bundled
Gemini +0.089 (+18.8%) +0.002 (+0.4%) Largest bullets penalty

Three providers (Anthropic, Mistral, DeepSeek) have ≤3% paragraph penalty in bundled — bundled is a legitimate choice there. OpenAI, Grok, Gemini show the classic "attention split" penalty.


Cloud pricing reference (Mar 2026, $/M tokens)

Provider Model $/M in $/M out
Gemini gemini-2.0-flash $0.075 $0.30
DeepSeek deepseek-chat $0.14 $0.28
Mistral mistral-small-latest $0.20 $0.60
Grok grok-3-mini $0.30 $0.50
Anthropic claude-haiku-4-5 $0.80 $4.00
OpenAI gpt-4o $2.50 $10.00

Ordered cheapest → most expensive (output tokens): Gemini < DeepSeek < Mistral < Grok < Anthropic Haiku < OpenAI GPT-4o. 33× spread from cheapest to most expensive.

Actual measured per-episode latency and cost are in the Full matrix above. See Compound analysis for recommended option order.


Provider-specific findings

OpenAI (gpt-4o)

  • Strengths: Reliable, good non-bundled quality, strong tooling ecosystem, json_object mode for bundled.
  • Weaknesses: Expensive. Bundled attention-split penalty is real. temperature=0 is not fully deterministic even with seed.
  • Quirks: Implemented seed plumbing (openai_summary_seed) — helps contestation stability but not full reproducibility.

Anthropic (claude-haiku-4-5-20251001)

  • Strengths: Best bundled quality by wide margin. Both tracks competitive. Judge agreement highest.
  • Weaknesses: More expensive than DeepSeek/Mistral/Gemini. No seed param in API.
  • Quirks: API has no response_format: json_object; bundled requires JSON prefill ({"role": "assistant", "content": "{"}) — implemented in provider. No seed; relies on temp=0 (empirically more deterministic than OpenAI's temp=0).

Gemini (gemini-2.0-flash)

  • Strengths: Cheapest by far on output tokens. Fast latency. Non-bundled bullets quality is close to top.
  • Weaknesses: Weakest bundled quality. Paragraph lags.
  • Quirks:
  • Bundled JSON responses sometimes contain raw control characters — parser uses json.loads(strict=False).
  • Gemini 2.5-flash (non-lite): Older google-genai 0.x could not set ThinkingConfig.thinking_budget, so thinking tokens starved real output. The provider now depends on google-genai 1.x and sends thinking_budget=0 for gemini-2.5-flash (excluding …-flash-lite). This report’s numbers are still from 2.0-flash / 2.5-flash-lite until the four-cell autoresearch re-run on gemini-2.5-flash (GitHub #572).

Mistral (mistral-small-latest)

  • Strengths: Mid-tier across all cells. Bundled paragraph surprisingly good (0.487 — beats OpenAI bundled paragraph 0.469).
  • Weaknesses: Non-bundled paragraph is worst in the matrix (0.456). No ecosystem edge.
  • Quirks: No API-specific issues encountered during v2 application.

DeepSeek (deepseek-chat)

  • Strengths: Best non-bundled quality on both tracks. Very cheap ($0.28/M output).
  • Weaknesses: API reliability — bundled calls sometimes time out under load (retries work). No seed param. max_tokens hard-capped at 8192 (provider clamps now).
  • Quirks: max_tokens > 8192 returns 400; we cap at 8192 in the bundled path. temperature=0 appears reasonably deterministic.

Grok (grok-3-mini)

  • Strengths: Mid-pack across all cells. No surprises.
  • Weaknesses: Nothing stands out — third or fourth in every cell. No compelling reason to pick over Anthropic (quality) or DeepSeek (cost).
  • Quirks: None encountered.

Ollama (local, 11 models evaluated)

  • Strengths: qwen3.5:9b matches DeepSeek cloud quality at $0/ep. Free, private, offline. Bundled mode fixes paragraph contestation (unlike non-bundled where 5 of 11 models contest on long-form held-out).
  • Weaknesses: Latency 12-80s/ep (cloud is 2-20s). Larger models often not better — qwen3.5:9b outperforms both 27b and 35b variants. Not all models reliably produce JSON for bundled.
  • Quirks:
  • JSON parser uses strict=False defensively (some models emit control characters).
  • Size-vs-quality: bigger ≠ better. qwen3.5 family peaks at 9B; mistral:7b beats qwen2.5:7b (generation matters more than size).
  • Paragraph contestation unpredictable: llama3.1:8b contested 5/5 episodes on held-out paragraph; uncontested uncorrelated with size.

Framework validation (across 17 model variants)

All champions were validated under the v2 framework across 6 cloud providers + 11 local Ollama models:

  • Champion prompts transferred cleanly across all 6 cloud providers and 11 local models. Zero provider-specific prompt tuning was required — the OpenAI champion prompts produced competitive numbers on every provider tested.
  • All champions generalise on held-out content. Cloud: 24 champions, dev→held-out deltas within ±5%. Local: 11 champions + 3 bundled, same generalisation property.
  • Framework reliability: 23 of 24 cloud cells and all 44 local cells ran cleanly first try (1 DeepSeek bundled transient API timeout retried successfully). One Ollama model (qwen3.5:27b held-out bullets) showed anomalous 505s latency — likely model-swap overhead, not a framework problem.
  • Judge contestation: Fraction-based threshold (≥40%) correctly rejected high-divergence runs without flipping on single-episode noise. Cloud: zero runs fell back to ROUGE-only. Local non-bundled paragraph: 5 of 11 models triggered contestation, surfacing a real local-model behaviour (mitigated by bundled mode).

The framework's central claim — that these numbers are trustworthy for cross-provider decision making — is now backed by 68 independent held-out validations (24 cloud + 44 local). Ship it.


What changed from v1 (before this session)

  • Framework: v1 had curated_5feeds_smoke_v1curated_5feeds_benchmark_v1 (contaminated validation). v2 uses disjoint dev_v1 (e01+e02) + held-out benchmark_v2 (e03 only).
  • Contestation: v1 binary-OR flipped runs to ROUGE-only on any divergent episode. v2 fraction-based threshold (40%) absorbs single-episode noise.
  • Rubric: v1 penalised long summaries via Conciseness dimension. v2 Efficiency dimension rewards content density without length penalty.
  • Prompt extraction: v2 extracts JSON prose before judging so bundled outputs are judged on semantic content, not JSON formatting.
  • Seed: OpenAI seed parameter plumbed through Config/Params/factory (partial mitigation for API non-determinism).
  • Prompt improvements: Champion prompts from OpenAI r7 tuning + paragraph v2 tuning ported to all providers (few-shot bullets, style narration, anti-patterns, 4-6 para default, opening sentence, coverage, verbatim terminology).

Earlier v1 reports (e.g. autoresearch/openai_comparison_2026-04-14.md) are superseded by this one. v1 used contaminated validation and should not be cited as authoritative.


Open items / deferred

  1. Multi-run averaging (N=3 per experiment). Would tighten confidence intervals on the numbers above and fully absorb API non-determinism. ~3× compute cost. RFC-073 §Future Work.

  2. Gemini 2.5-flash with thinking disabled. Likely ~3-5% quality lift across all Gemini cells. Blocked by google-genai SDK version pin.

  3. Provider-specific prompt tuning. All 6 providers run champion-ported prompts — no dedicated tuning. Expected 2-5% additional per cell with per-provider tuning. Low priority given DeepSeek and Anthropic already dominate their respective cells.

  4. Larger held-out dataset. 5 episodes give ±5% noise on held-out. Would help distinguish similarly-good champions (e.g., DeepSeek vs Anthropic non-bundled bullets differs by 2.8%, within noise bounds).

  5. Sonnet 4.6 as candidate (currently only silver). Would reveal how much headroom exists above Haiku 4.5 and DeepSeek. Not urgent given Haiku + DeepSeek already strong.

  6. Ollama / local model providers (llama31, llama32, qwen, phi3, gemma, etc.). v1 had extensive ollama configs; v2 has not yet applied. Valuable for on-device deployment decisions.

  7. Champion consolidation. We now have 24 (provider × mode × track) champion combinations. Long-term, only a handful matter for production. Pick 3-4 primary recommendations and mark the rest as reference-only.