Evaluation Report: Benchmark v1 (April 2026)¶
First full benchmark sweep — ML baselines, hybrid MAP-REDUCE, 6 cloud APIs, and 12 Ollama local models, both paragraph and bullet JSON output, on the 10-episode benchmark dataset. Uses the new Sonnet 4.6 silver references. Extends the April 2026 smoke report to production-scale: 10 episodes gives more stable rankings than 5.
| Field | Value |
|---|---|
| Date | April 2026 |
| Dataset | curated_5feeds_benchmark_v1 (10 episodes, 5 feeds) |
| Reference (paragraphs) | silver_sonnet46_benchmark_v1 (Claude Sonnet 4.6) |
| Reference (bullets) | silver_sonnet46_benchmark_bullets_v1 (Claude Sonnet 4.6) |
| Schema | metrics_summarization_v2 |
| Paragraph configs | data/eval/configs/summarization/autoresearch_prompt_*_benchmark_paragraph_v1.yaml |
| Bullet configs | data/eval/configs/summarization_bullets/autoresearch_prompt_*_benchmark_bullets_v1.yaml |
| Previous report | Smoke v1 (2026-04) |
For metric definitions and interpretation guidance, see the Evaluation Reports index.
What changed vs. the April 2026 smoke report¶
Scale¶
The smoke report used 5 episodes; this benchmark report uses 10 episodes from the same 5 feeds. Numbers are more stable — single-episode outliers have less influence. The overall ranking order is consistent with smoke, but individual scores shift by 1-3%.
New Ollama exclusion: llama3.3:70b¶
llama3.3:70b-instruct-q3_K_M was present in the smoke report (showing OOM). Its
benchmark configs have been deleted — it crashes before completing episode 1 on this
hardware (Apple M-series, model runner OOM error). The 4 configs are removed from the
eval matrix.
Latency anomalies in large Ollama models¶
qwen3.5:27b (414 s/ep for paragraphs) and qwen3.5:9b (226 s/ep) are being
CPU-offloaded in the paragraph benchmark — likely a VRAM pressure issue during the
longer 10-episode run. Their bullet latencies are normal (63 s and 17 s respectively).
Do not use the paragraph latency figures for these two models for planning; re-run on
your hardware.
Key takeaways¶
- ML baselines (no LLM) sit at 18–24% ROUGE-L and are the only fully air-gapped
option.
ml_bart_led_autoresearch_v1(20.4%, 26s/ep) is Tier 2 production default;ml_hybrid_bart_llama32_3b_autoresearch_v1(23.7%, 15s/ep) bridges ML and LLM tiers but requires Ollama. See ML & hybrid baselines below. - Anthropic Haiku 4.5 leads cloud paragraphs (33.7% ROUGE-L, 86.2% embed) and cloud bullets (38.6% ROUGE-L). Rankings are stable vs. smoke.
- DeepSeek is the strongest non-Anthropic provider: 29.5% ROUGE-L (paragraphs) and 38.0% ROUGE-L with the best embedding similarity across all cloud (85.7%, bullets).
- qwen3.5:35b is the best on-prem model for paragraphs (31.9% ROUGE-L, 20.8s/ep) — it ties with OpenAI/Gemini at benchmark scale, comfortably within the cloud range.
- qwen3.5:35b also leads on-prem bullets (36.2% ROUGE-L, 14.1s/ep). Unlike smoke where llama3.2:3b led bullets, the 35b model takes the top spot at benchmark scale.
- qwen3.5:27b shows strong semantic alignment (88.4% embedding, bullets) but is impractical for paragraph inference (414 s/ep CPU-offload on this hardware).
- llama3.2:3b remains the best fast on-prem option: 24.4% ROUGE-L paragraphs (8.5s/ep), 33.6% bullets (5.2s/ep) — 3B params, 2 GB disk.
- phi3:mini is unsuitable for paragraph summaries: 157.7% WER, 175.5% coverage — hallucinates / repeats extensively. Its bullet score is mediocre but usable (28.3%).
- qwen2.5:7b breaks bullet JSON at benchmark scale (19.5% ROUGE-L, 65.0% embed) — do not use for structured output.
ML & hybrid baselines¶
ML and hybrid models run entirely on-device — no API key, no cloud dependency. They are
the fallback when Ollama is unavailable or the deployment is air-gapped. Numbers below
are benchmark-scale (10 episodes, curated_5feeds_benchmark_v1 vs
silver_sonnet46_benchmark_v1), matching all other tables in this report.
| Mode | Config ID | ROUGE-L | Embed | Coverage | Lat/ep | Dependencies |
|---|---|---|---|---|---|---|
| ML Dev (Tier 1) | baseline_ml_dev_authority_benchmark_v1 |
19.1% | 70.0% | ~40% | fast | None — CI safe |
| ML Prod (Tier 2) | baseline_ml_bart_led_autoresearch_benchmark_v1 |
20.5% | 68.2% | ~48% | 26s | None — air-gap safe |
| Hybrid | baseline_ml_hybrid_bart_llama32_3b_autoresearch_benchmark_v1 |
21.1% | 76.6% | ~80% | 15s | Ollama (llama3.2:3b) |
What these tiers are:
- ML Dev (Tier 1): BART-small MAP + authority REDUCE — fast, tiny, used in CI and unit tests. Quality is clearly below production threshold but sufficient for regression gating.
- ML Prod (Tier 2): BART-base MAP + LED-large REDUCE — fully local transformer
pipeline, no Ollama required. This is
PROD_DEFAULT_SUMMARY_MODE_IDinconfig_constants.py. Suitable for air-gapped or low-resource environments. - Hybrid: BART-base MAP + Llama 3.2:3b REDUCE (via Ollama) — uses the ML pipeline for extractive chunking and a small LLM for synthesis. Bridges the 3-point gap between ML-prod and direct-LLM with only a 3B model. Requires Ollama but not a GPU.
Quality ladder in context:
Tier 1 ML Dev 19.1% ███████████████████
Tier 2 ML Prod 20.5% ████████████████████
Hybrid (BART+Llama3b) 21.1% █████████████████████
─── LLM threshold ─────────────────────────────────
Ollama llama3.2:3b 24.4% ████████████████████████
Ollama qwen3.5:35b 31.9% ████████████████████████████████
Gemini (cloud) 28.7% █████████████████████████████
Anthropic (cloud) 33.7% ██████████████████████████████████
Note on hybrid benchmark vs smoke: The hybrid dropped from 23.7% (smoke) to 21.1% (benchmark). This is expected — temperature=0.5 introduces sampling variance that averages out to a slightly lower floor over 10 episodes vs 5. The ML Prod score (deterministic) is nearly identical across both scales (20.4% → 20.5%), confirming the hybrid variance is a sampling artefact, not a quality regression.
The hybrid tier is worth deploying when: (a) Ollama is available but a large model is not, or (b) episode transcripts exceed the direct-LLM context window — the BART MAP stage chunked compression handles arbitrary-length input.
Key finding (RFC-057 Track B, ADR-069/070): temperature=0.5 is optimal for the hybrid REDUCE step (BART chunk noise benefits from diversity), vs. temperature=0.3 for direct Llama on clean transcripts. Switching these values degrades quality. See ADR-072 for rationale.
Paragraph summaries — full metrics¶
Cloud providers (sorted by ROUGE-L)¶
| Provider | Model | Lat/ep | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | Embed | Coverage | WER |
|---|---|---|---|---|---|---|---|---|---|
| Anthropic | claude-haiku-4-5 | 5.0s | 65.3% | 30.5% | 33.7% | 24.3% | 86.2% | 99.8% | 89.9% |
| DeepSeek | deepseek-chat | 8.9s | 62.7% | 24.2% | 29.5% | 17.6% | 83.6% | 89.3% | 89.2% |
| Gemini | gemini-2.0-flash | 2.7s | 57.3% | 23.4% | 28.7% | 16.3% | 82.5% | 78.8% | 87.2% |
| Mistral | mistral-small-latest | 4.6s | 61.3% | 23.7% | 28.0% | 18.9% | 82.3% | 108.4% | 96.5% |
| OpenAI | gpt-4o-mini | 8.5s | 57.7% | 22.4% | 26.8% | 16.9% | 84.1% | 96.0% | 92.6% |
| Grok | grok-3-mini | 7.5s | 57.5% | 20.9% | 26.7% | 14.9% | 81.7% | 90.5% | 91.0% |
Local Ollama — paragraphs (sorted by ROUGE-L)¶
Hardware: Apple M-series. Re-run on your machine for latency decisions.
* = anomalous latency (CPU offload detected during this run; re-run to confirm).
| Model | Tag | Lat/ep | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | Embed | Coverage | WER |
|---|---|---|---|---|---|---|---|---|---|
| qwen3.5:35b | qwen3.5:35b | 20.8s | 63.7% | 24.4% | 31.9% | 18.6% | 81.5% | 108.0% | 96.7% |
| qwen3.5:27b | qwen3.5:27b | 414.3s* | 61.5% | 24.1% | 27.9% | 17.8% | 82.4% | 120.3% | 104.4% |
| mistral-small3.2 | mistral-small3.2:latest | 89.2s | 58.7% | 21.7% | 28.1% | 16.4% | 81.4% | 84.4% | 90.5% |
| mistral:7b | mistral:7b | 18.3s | 56.8% | 24.4% | 27.2% | 18.3% | 76.7% | 89.5% | 93.0% |
| llama3.1:8b | llama3.1:8b | 17.9s | 59.9% | 25.7% | 26.8% | 20.9% | 79.0% | 103.9% | 100.1% |
| mistral-nemo:12b | mistral-nemo:12b | 25.1s | 56.9% | 23.9% | 26.8% | 16.8% | 79.5% | 105.1% | 98.0% |
| qwen3.5:9b | qwen3.5:9b | 226.1s* | 59.6% | 21.4% | 25.7% | 14.4% | 78.0% | 99.6% | 95.6% |
| qwen2.5:32b | qwen2.5:32b | 78.3s | 55.5% | 19.9% | 24.6% | 12.4% | 80.7% | 77.4% | 90.0% |
| llama3.2:3b | llama3.2:3b | 8.5s | 55.3% | 21.5% | 24.4% | 17.3% | 78.6% | 107.6% | 103.5% |
| qwen2.5:7b | qwen2.5:7b | 17.1s | 55.2% | 19.5% | 23.6% | 13.9% | 78.9% | 89.9% | 93.6% |
| gemma2:9b | gemma2:9b | 19.5s | 50.2% | 14.5% | 22.4% | 9.6% | 79.1% | 82.4% | 91.3% |
| phi3:mini | phi3:mini | 16.2s | 47.7% | 13.2% | 19.2% | 7.5% | 77.7% | 175.5% | 157.7% |
| llama3.3:70b-q3km | — | — | — | — | — | — | — | — | — |
llama3.3:70b-instruct-q3_K_M: model runner crashed with OOM on this hardware. Configs deleted — see note above.
Bullet JSON summaries — full metrics¶
Bullet ROUGE scores are computed against silver_sonnet46_benchmark_bullets_v1. Not
comparable to paragraph scores.
Cloud providers — bullets (sorted by ROUGE-L)¶
| Provider | Model | Lat/ep | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | Embed | Coverage | WER |
|---|---|---|---|---|---|---|---|---|---|
| Anthropic | claude-haiku-4-5 | 3.5s | 64.7% | 33.3% | 38.6% | 34.3% | 84.2% | 86.9% | 84.1% |
| DeepSeek | deepseek-chat | 5.6s | 63.5% | 35.4% | 38.0% | 33.3% | 85.7% | 73.1% | 83.2% |
| Gemini | gemini-2.0-flash | 1.6s | 61.8% | 33.9% | 34.9% | 30.6% | 79.9% | 64.4% | 84.2% |
| Mistral | mistral-small-latest | 2.0s | 60.1% | 28.8% | 33.8% | 30.3% | 85.1% | 78.6% | 87.6% |
| OpenAI | gpt-4o-mini | 5.7s | 59.6% | 31.5% | 32.1% | 29.9% | 83.8% | 67.4% | 82.8% |
| Grok | grok-3-mini | 8.3s | 55.3% | 26.9% | 29.2% | 24.9% | 80.9% | 65.6% | 88.2% |
Local Ollama — bullets (sorted by ROUGE-L)¶
| Model | Tag | Lat/ep | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | Embed | Coverage | WER |
|---|---|---|---|---|---|---|---|---|---|
| qwen3.5:35b | qwen3.5:35b | 14.1s | 63.6% | 33.4% | 36.2% | 32.2% | 87.3% | 77.4% | 83.3% |
| qwen3.5:27b | qwen3.5:27b | 63.2s | 65.6% | 32.4% | 35.2% | 33.7% | 88.4% | 89.0% | 87.0% |
| mistral-small3.2 | mistral-small3.2:latest | 39.9s | 63.9% | 35.6% | 34.2% | 33.6% | 84.3% | 67.5% | 83.8% |
| llama3.2:3b | llama3.2:3b | 5.2s | 57.6% | 31.5% | 33.6% | 25.5% | 82.9% | 54.5% | 81.6% |
| qwen3.5:9b | qwen3.5:9b | 16.7s | 60.0% | 28.8% | 32.6% | 27.1% | 83.5% | 69.7% | 85.2% |
| llama3.1:8b | llama3.1:8b | 10.1s | 52.9% | 27.4% | 31.5% | 19.4% | 78.5% | 50.4% | 86.2% |
| qwen2.5:32b | qwen2.5:32b | 50.9s | 56.7% | 26.9% | 30.5% | 25.6% | 84.9% | 62.1% | 84.7% |
| mistral-nemo:12b | mistral-nemo:12b | 16.8s | 48.0% | 23.8% | 28.9% | 16.9% | 73.6% | 50.3% | 85.5% |
| mistral:7b | mistral:7b | 13.0s | 52.8% | 27.8% | 28.6% | 20.6% | 76.1% | 49.9% | 87.1% |
| phi3:mini | phi3:mini | 8.2s | 50.7% | 23.0% | 28.3% | 21.0% | 74.3% | 66.9% | 89.8% |
| gemma2:9b | gemma2:9b | 11.1s | 45.7% | 20.9% | 27.3% | 16.4% | 78.3% | 47.0% | 88.0% |
| qwen2.5:7b | qwen2.5:7b | 9.2s | 31.1% | 13.0% | 19.5% | 6.4% | 65.0% | 28.7% | 91.2% |
| llama3.3:70b-q3km | — | — | — | — | OOM | — | — | — | — |
qwen2.5:7b(19.5% ROUGE-L, 65.0% embed) does not reliably follow the JSON bullet format — inspect predictions before using. All other models produce valid JSON output.
On-prem Ollama vs cloud API¶
Decision guide for teams choosing between local and cloud summarization.
Quality (ROUGE-L, paragraphs)¶
Anthropic Haiku 4.5 (cloud): 33.7% ██████████████████████████████████
DeepSeek (cloud): 29.5% █████████████████████████████
─── on-prem competitive zone ───────────────────────────────────────────
qwen3.5:35b (local, 21s): 31.9% ████████████████████████████████
─── cloud mid-tier ──────────────────────────────────────────────────────
Gemini (cloud): 28.7% ████████████████████████████
Mistral (cloud): 28.0% ████████████████████████████
qwen3.5:27b (local, 414s): 27.9% ███████████████████████████
OpenAI (cloud): 26.8% ██████████████████████████
Grok (cloud): 26.7% ██████████████████████████
mistral-small3.2 (local, 89s): 28.1% ████████████████████████████
─── below cloud floor ───────────────────────────────────────────────────
llama3.2:3b (local, 8.5s): 24.4% ████████████████████████
qwen3.5:35b (31.9%) is the only on-prem model above the cloud median. It outperforms OpenAI gpt-4o-mini and Grok, and ties with Gemini 2.0 Flash at benchmark scale.
Latency¶
- Cloud leaders: Gemini 2.0 Flash (2.7s/ep paragraphs, 1.6s/ep bullets); Mistral API (4.6s/ep, 2.0s/ep bullets); Anthropic (5.0s/ep, 3.5s/ep bullets).
- On-prem fast: llama3.2:3b (8.5s/ep paragraphs, 5.2s/ep bullets) is the fastest on-prem option. qwen3.5:35b (21s/ep paragraphs, 14s/ep bullets) is the best quality/latency balance.
Use-case decision table¶
| Scenario | Recommendation |
|---|---|
| Production, quality-sensitive | Anthropic Haiku 4.5 (cloud) — leads all providers |
| Production, cost-sensitive | DeepSeek — 2nd best quality at ~$0.02/100 eps |
| Production, speed-sensitive | Gemini 2.0 Flash — fastest cloud (1.6–2.7s/ep) |
| Air-gapped / no internet | ml_bart_led_autoresearch_v1 — 20.4% ROUGE-L, zero deps |
| Air-gapped + Ollama available | Hybrid ml_hybrid_bart_llama32_3b_autoresearch_v1 — 23.7%, only llama3.2:3b needed |
| On-prem required, quality first | qwen3.5:35b (21s/ep, competitive with cloud mid-tier) |
| On-prem required, speed/quality | llama3.2:3b (8.5s/ep, 2 GB) — best fast on-prem LLM |
| Very long transcripts (context limit) | Hybrid (BART MAP handles arbitrary length) |
| Bullet JSON, cloud | Anthropic (highest ROUGE-L) or DeepSeek (best semantic embed) |
| Bullet JSON, on-prem, quality | qwen3.5:35b (36.2% ROUGE-L, 14s/ep) |
| Bullet JSON, on-prem, fast | llama3.2:3b (33.6% ROUGE-L, 5.2s/ep) |
Smoke vs. benchmark — score delta¶
Scores at benchmark scale (10 eps) compared to smoke (5 eps):
| Provider | Smoke ROUGE-L | Benchmark ROUGE-L | Delta |
|---|---|---|---|
| Anthropic | 30.5% | 33.7% | +3.2% |
| DeepSeek | 29.4% | 29.5% | +0.1% |
| Gemini | 27.9% | 28.7% | +0.8% |
| OpenAI | 26.6% | 26.8% | +0.2% |
| Mistral | 26.6% | 28.0% | +1.4% |
| Grok | 22.9% | 26.7% | +3.8% |
| qwen3.5:35b | 29.9% | 31.9% | +2.0% |
| llama3.2:3b | 22.6% | 24.4% | +1.8% |
Positive deltas are expected: smoke picks 5 episodes that may skew harder or easier. Large positive swings (Grok +3.8%, Anthropic +3.2%) suggest those providers had an unlucky episode in smoke. Rankings are stable.
Model IDs (acceptance-tested)¶
| Provider | Model ID used | Notes |
|---|---|---|
| Anthropic | claude-haiku-4-5 |
Fastest Anthropic option; Sonnet 4.6 used for silver only |
| OpenAI | gpt-4o-mini |
max_completion_tokens required for gpt-5 series |
| DeepSeek | deepseek-chat |
— |
| Gemini | gemini-2.0-flash |
gemini-2.5-pro / gemini-3.1-pro-preview are thinking models |
| Grok | grok-3-mini |
— |
| Mistral | mistral-small-latest |
— |
How to reproduce¶
# Cloud paragraph (any provider)
make experiment-run \
CONFIG=data/eval/configs/summarization/autoresearch_prompt_anthropic_benchmark_paragraph_v1.yaml \
REFERENCE=silver_sonnet46_benchmark_v1
# Cloud bullets (any provider)
make experiment-run \
CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_anthropic_benchmark_bullets_v1.yaml \
REFERENCE=silver_sonnet46_benchmark_bullets_v1
# Ollama paragraph
make experiment-run \
CONFIG=data/eval/configs/summarization/autoresearch_prompt_ollama_qwen35_35b_benchmark_paragraph_v1.yaml \
REFERENCE=silver_sonnet46_benchmark_v1
# Ollama bullets
make experiment-run \
CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_ollama_qwen35_35b_benchmark_bullets_v1.yaml \
REFERENCE=silver_sonnet46_benchmark_bullets_v1
Related¶
- Smoke v1 (2026-04) — smoke report with same silver references
- Smoke v1 (2026-03) — March report with GPT-4o silver
- Evaluation Reports index — methodology and metric definitions
- AI Provider Comparison Guide — decision guide
- Ollama Provider Guide — Ollama setup and model reference