Evaluation Report: Smoke v1 (April 2026)¶
Second full provider sweep — 6 cloud APIs + 12 Ollama local models, both paragraph and bullet JSON output, vs the new Claude Sonnet 4.6 silver reference. Includes prompt tuning (RFC-057 autoresearch round 2) results and a cloud vs. on-prem decision section.
| Field | Value |
|---|---|
| Date | April 2026 |
| Dataset | curated_5feeds_smoke_v1 (5 episodes, 5 feeds) |
| Reference (paragraphs) | silver_sonnet46_smoke_v1 (Claude Sonnet 4.6) |
| Reference (bullets) | silver_sonnet46_smoke_bullets_v1 (Claude Sonnet 4.6) |
| Schema | metrics_summarization_v2 |
| Paragraph configs | data/eval/configs/summarization/autoresearch_prompt_*_smoke_paragraph_v1.yaml |
| Bullet configs | data/eval/configs/summarization_bullets/autoresearch_prompt_*_smoke_bullets_v1.yaml |
| Ollama paragraph configs | data/eval/configs/summarization/autoresearch_prompt_ollama_*_smoke_paragraph_v1.yaml |
| Ollama bullet configs | data/eval/configs/summarization_bullets/autoresearch_prompt_ollama_*_smoke_bullets_v1.yaml |
| Previous report | Smoke v1 (2026-03) |
For metric definitions and interpretation guidance, see the Evaluation Reports index.
What changed since March 2026¶
New silver reference¶
The March 2026 report used silver_gpt4o_smoke_v1 (GPT-4o summaries) as the reference.
This report uses silver_sonnet46_smoke_v1 (Claude Sonnet 4.6), selected in April 2026
via a pairwise LLM judge tournament: Sonnet 4.6 beat GPT-5.4 (3-1-1) and swept Gemini
2.0 Flash (5-0) on 5 smoke episodes using dual OpenAI + Anthropic judges.
Why this matters for numbers: ROUGE and BLEU measure surface overlap with the reference. Changing the reference changes all scores — they are not directly comparable to the March 2026 figures. Specifically:
- March 2026 numbers were inflated for OpenAI (GPT family scored against GPT reference). That bias is reduced here since the reference is Anthropic.
- Ranking order shifts: Anthropic now leads cloud providers on ROUGE-L (32.6%) where it was second-to-last in March. OpenAI dropped from first (58.8%) to fifth (25.7%).
- Ollama Qwen 3.5:35b and 27b now lead local models (29.9% and 29.2%), displacing Mistral Small 3.2 and Qwen 2.5:32b that tied in March (both at 38.4% vs GPT-4o silver).
The ranking shift is expected and meaningful: Sonnet 4.6 writes denser, more content-rich summaries than GPT-4o. Models that structure their output similarly (Anthropic family, verbose Qwen 3.5 models) score better against this reference.
Prompt tuning (RFC-057 round 2)¶
All 6 cloud provider prompts were tuned via the autoresearch ratchet loop (≥1% gain to
accept, 3 consecutive fails = early stop) against silver_sonnet46_smoke_v1. Results:
| Provider | Round 1 score | Round 2 score | Key wins |
|---|---|---|---|
| Anthropic | 0.287 | 0.523 | Thesis opener (+0.201), vocab alignment (+0.024) |
| Gemini | 0.446 | 0.475 | Thesis opener + vocab + anchor |
| Grok | 0.437 | 0.456 | Thesis opener + vocab + anchor |
| DeepSeek | 0.502 | 0.502 | Saturated (focus line only, no new wins) |
| Mistral | 0.468 | 0.480 | Cause-effect relationships |
| OpenAI | 0.474 | 0.474 | Saturated (all new instructions caused judge divergence) |
Anthropic had the largest gain because it started from the lowest baseline and had the most missing structural instructions. DeepSeek and OpenAI were already near-saturated.
Bullets track (new)¶
This report introduces a second output format: bullet JSON summaries
(bullets_json_v1 template). The bullets silver reference
(silver_sonnet46_smoke_bullets_v1) was generated by Sonnet 4.6 using the same
shared template. Bullet scores are not comparable to paragraph scores — they are a
separate track with a separate reference.
Round 1 bullet autoresearch (shared template): early stop after 3 fails — the template was already well-tuned from a prior session.
Model additions¶
Two new Ollama models added since the March sweep:
- llama3.2:3b — smallest model in the set (2GB, 7.3s/ep)
- llama3.3:70b-instruct-q3_K_M — largest model (34GB); crashed with OOM on this hardware and is excluded from results
Key takeaways¶
- Anthropic Haiku 4.5 is the best cloud provider on ROUGE-L (32.6%) — a complete reversal from March, driven by the new Sonnet 4.6 silver reference and round-2 prompt tuning. Its 86.8% embedding similarity is also the highest in the cloud set.
- For cloud providers, the ranking by ROUGE-L is now: Anthropic > DeepSeek > Gemini > OpenAI = Mistral > Grok. This is the opposite of the March order.
- Qwen 3.5:35b leads Ollama on ROUGE-L (29.9%), followed closely by Qwen 3.5:27b (29.2%). Both outperform every cloud provider except Anthropic.
- Mistral Small 3.2 remains a strong local option at 25.2% ROUGE-L with the best BLEU among Ollama paragraph models — but it fell from joint first to mid-table due to the reference change.
- On-prem vs. cloud: Qwen 3.5:35b (20s/ep local) is competitive with DeepSeek/Gemini/OpenAI on ROUGE-L while keeping data on-device. See the on-prem vs cloud section below.
- Bullets track: Anthropic leads cloud bullets (39.1% ROUGE-L), followed by
Gemini (37.1%). DeepSeek leads on embedding similarity (85.0%). For local bullets,
llama3.2:3b surprises at 35.5% ROUGE-L (fastest at 4.8s/ep); Mistral Small 3.2 leads
on embedding (85.8%).
qwen2.5:7bbreaks on bullet JSON — skip it for this track. - phi3:mini continues to show extreme verbosity (152.7% coverage, 141.1% WER in paragraphs) — embedding similarity is decent (79.6%) but length makes it unsuitable for production paragraph use.
Paragraph summaries — full metrics¶
Cloud providers (sorted by ROUGE-L)¶
| Provider | Model | Lat/ep | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | Embed | Coverage | WER |
|---|---|---|---|---|---|---|---|---|---|
| Anthropic | claude-haiku-4-5 | 5.1s | 63.1% | 24.9% | 32.6% | 18.9% | 86.8% | 91.0% | 86.6% |
| DeepSeek | deepseek-chat | 9.8s | 64.2% | 23.9% | 28.3% | 16.7% | 81.5% | 81.4% | 87.1% |
| Gemini | gemini-2.0-flash | 2.9s | 54.9% | 20.4% | 27.6% | 11.6% | 81.6% | 63.7% | 86.3% |
| OpenAI | gpt-4o-mini | 8.2s | 58.2% | 19.4% | 25.7% | 13.3% | 82.5% | 81.7% | 90.6% |
| Mistral | mistral-small-latest | 4.0s | 61.8% | 21.2% | 25.7% | 16.5% | 80.1% | 105.9% | 96.4% |
| Grok | grok-3-mini | 8.9s | 57.2% | 17.6% | 25.1% | 11.4% | 77.8% | 81.3% | 90.9% |
Local Ollama — paragraphs (sorted by ROUGE-L)¶
Hardware: Apple M-series. Re-run on your machine for latency decisions.
| Model | Tag | Lat/ep | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | Embed | Coverage | WER |
|---|---|---|---|---|---|---|---|---|---|
| qwen3.5:35b | qwen3.5:35b | 20.4s | 65.1% | 26.0% | 29.9% | 18.4% | 81.8% | 94.0% | 91.7% |
| qwen3.5:27b | qwen3.5:27b | 82.8s | 61.5% | 23.5% | 29.2% | 18.5% | 80.4% | 103.8% | 96.9% |
| mistral-small3.2 | mistral-small3.2:latest | 45.5s | 58.1% | 19.3% | 25.2% | 11.4% | 77.1% | 73.8% | 89.0% |
| qwen2.5:32b | qwen2.5:32b | 58.5s | 54.3% | 16.8% | 25.4% | 9.1% | 76.6% | 63.3% | 88.5% |
| qwen3.5:9b | qwen3.5:9b | 23.9s | 59.2% | 20.8% | 24.7% | 12.1% | 75.8% | 82.9% | 92.1% |
| mistral-nemo:12b | mistral-nemo:12b | 24.6s | 53.4% | 18.8% | 23.6% | 12.2% | 77.2% | 91.2% | 96.3% |
| llama3.1:8b | llama3.1:8b | 15.7s | 57.4% | 18.3% | 23.0% | 13.4% | 77.9% | 88.8% | 95.1% |
| llama3.2:3b | llama3.2:3b | 7.3s | 56.3% | 18.5% | 22.6% | 12.4% | 78.9% | 85.4% | 93.0% |
| mistral:7b | mistral:7b | 16.1s | 54.0% | 18.4% | 22.3% | 11.6% | 71.4% | 78.3% | 91.3% |
| qwen2.5:7b | qwen2.5:7b | 13.8s | 55.1% | 17.4% | 21.4% | 10.8% | 78.4% | 77.2% | 92.6% |
| gemma2:9b | gemma2:9b | 16.4s | 50.4% | 13.0% | 21.1% | 7.1% | 78.0% | 72.2% | 90.3% |
| phi3:mini | phi3:mini | 16.5s | 50.3% | 12.1% | 18.0% | 6.9% | 79.6% | 152.7% | 141.1% |
| llama3.3:70b-q3km | — | — | — | — | — | — | — | — | — |
llama3.3:70b-instruct-q3_K_M(34GB): model runner crashed with OOM on this hardware. Excluded from results.
Bullet JSON summaries — full metrics¶
Bullet ROUGE scores are computed against silver_sonnet46_smoke_bullets_v1 (JSON bullet
text extracted and concatenated). Not comparable to paragraph scores.
Cloud providers — bullets (sorted by ROUGE-L)¶
| Provider | Model | Lat/ep | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | Embed |
|---|---|---|---|---|---|---|---|
| Anthropic | claude-haiku-4-5 | 3.4s | 62.3% | 30.6% | 39.1% | 32.2% | 82.8% |
| Gemini | gemini-2.0-flash | 1.9s | 64.1% | 35.1% | 37.1% | 30.7% | 82.9% |
| DeepSeek | deepseek-chat | 8.8s | 65.0% | 32.3% | 34.1% | 30.0% | 85.0% |
| Grok | grok-3-mini | 10.8s | 55.2% | 25.2% | 32.3% | 22.6% | 84.3% |
| OpenAI | gpt-4o-mini | 3.3s | 54.9% | 27.6% | 30.7% | 23.1% | 84.0% |
| Mistral | mistral-small-latest | 2.6s | 59.9% | 27.4% | 29.8% | 28.8% | 84.2% |
Local Ollama — bullets (sorted by ROUGE-L)¶
| Model | Tag | Lat/ep | ROUGE-1 | ROUGE-L | Embed |
|---|---|---|---|---|---|
| llama3.2:3b | llama3.2:3b | 4.8s | 57.0% | 35.5% | 79.8% |
| mistral-small3.2 | mistral-small3.2:latest | 30.5s | 62.0% | 34.9% | 85.8% |
| qwen3.5:27b | qwen3.5:27b | 53.6s | 63.4% | 33.5% | 84.7% |
| qwen3.5:9b | qwen3.5:9b | 15.4s | 60.7% | 33.3% | 83.4% |
| qwen3.5:35b | qwen3.5:35b | 16.4s | 63.0% | 30.8% | 84.6% |
| qwen2.5:32b | qwen2.5:32b | 40.9s | 52.9% | 30.1% | 82.4% |
| gemma2:9b | gemma2:9b | 11.7s | 50.4% | 31.1% | 81.3% |
| phi3:mini | phi3:mini | 8.0s | 53.4% | 30.7% | 78.4% |
| llama3.1:8b | llama3.1:8b | 10.3s | 54.2% | 29.6% | 77.3% |
| mistral:7b | mistral:7b | 11.1s | 52.5% | 27.8% | 81.2% |
| mistral-nemo:12b | mistral-nemo:12b | 16.2s | 53.1% | 27.6% | 77.9% |
| qwen2.5:7b | qwen2.5:7b | 7.3s | 18.9% | 11.9% | 52.0% |
| llama3.3:70b-q3km | — | — | — | OOM | — |
qwen2.5:7b(11.9% ROUGE-L, 52.0% embed) does not reliably follow the JSON bullet format — inspect predictions before using. All other models produce valid JSON output.llama3.3:70b-instruct-q3_K_Mcrashed with OOM on this hardware.
On-prem Ollama vs cloud API¶
Decision guide for teams choosing between local and cloud summarization.
This section compares the best on-prem options against cloud APIs across the dimensions that matter for a production deployment: quality, latency, cost, and privacy.
Quality (ROUGE-L, paragraphs)¶
Anthropic Haiku 4.5 (cloud): 32.6% ██████████████████████████████████
DeepSeek (cloud): 28.3% ████████████████████████████
Gemini 2.0 Flash (cloud): 27.6% ███████████████████████████
─── on-prem competitive zone ───────────────────────────────────────────
qwen3.5:35b (local, 20s): 29.9% ██████████████████████████████
qwen3.5:27b (local, 83s): 29.2% █████████████████████████████
mistral-small3.2 (local, 46s): 25.2% █████████████████████████
qwen2.5:32b (local, 59s): 25.4% █████████████████████████
─── below cloud floor ───────────────────────────────────────────────────
qwen3.5:9b (local, 24s): 24.7% ████████████████████████
llama3.1:8b (local, 16s): 23.0% ███████████████████████
qwen3.5:35b is the only on-prem model that consistently exceeds the cloud median (~27%). It outperforms Gemini 2.0 Flash and is within 3 points of Anthropic.
Latency¶
- Cloud leaders: Gemini 2.0 Flash (2.9s/ep), Mistral API (4.0s/ep) for paragraphs; Gemini (1.9s/ep) for bullets.
- On-prem: qwen3.5:35b at 20s/ep is 7–10× slower than the fastest cloud APIs, but processes 5 episodes in ~100s total — acceptable for batch/background jobs.
- On-prem fast tier: llama3.2:3b (7.3s/ep), qwen2.5:7b (13.8s/ep) are the fastest local options, but their quality is below the cloud floor.
Privacy and cost¶
| Scenario | Recommendation |
|---|---|
| Production, quality-sensitive | Anthropic Haiku 4.5 (cloud) |
| Production, cost-sensitive | Gemini 2.0 Flash (cloud, fastest + strong quality) |
| On-prem required, quality first | qwen3.5:35b (local, 35B params, 23GB VRAM) |
| On-prem required, speed/quality balance | mistral-small3.2:latest (local, 15GB) |
| On-prem required, low-resource | llama3.2:3b (local, 2GB, 7.3s/ep) |
| Bullet JSON, cloud | Anthropic Haiku 4.5 (highest ROUGE-L) or Gemini (fastest) |
| Bullet JSON, on-prem, speed | llama3.2:3b (35.5% ROUGE-L, 4.8s/ep — best fast option) |
| Bullet JSON, on-prem, quality | mistral-small3.2 (34.9% ROUGE-L, best embed 85.8%) |
Summary¶
If privacy or data-residency requirements mandate on-prem: qwen3.5:35b is the only local model that reaches cloud-competitive quality (29.9% ROUGE-L vs 27–32% cloud range). If latency is the constraint, drop to mistral-small3.2 (46s/ep, 25.2%). Everything below that tier is at least 4 percentage points behind the worst cloud API.
Why the rankings changed vs. March 2026¶
| What changed | Effect |
|---|---|
| Silver reference: GPT-4o → Sonnet 4.6 | GPT-family models lost their reference advantage; Anthropic gained one |
| Prompt tuning (round 2) for cloud APIs | Anthropic prompt improved most (+82%); DeepSeek/OpenAI already saturated |
| New Ollama models (llama3.2:3b, llama3.3:70b) | 3B adds a fast low-quality option; 70B OOM on this hardware |
| Reference is denser/longer (Sonnet 4.6) | All ROUGE-L numbers are lower than March; models that match density score better |
The most important takeaway: the reference matters more than the model. OpenAI dropped from 58.8% to 25.7% ROUGE-L not because it got worse, but because it no longer shares a generation family with the reference. Always re-run comparisons when the reference changes.
Model IDs (acceptance-tested)¶
| Provider | Model ID used | Notes |
|---|---|---|
| Anthropic | claude-haiku-4-5 |
Fastest Anthropic option; Sonnet 4.6 used for silver only |
| OpenAI | gpt-4o-mini |
max_completion_tokens required for gpt-5 series |
| DeepSeek | deepseek-chat |
— |
| Gemini | gemini-2.0-flash |
gemini-2.5-pro / gemini-3.1-pro-preview are thinking models; exhaust tokens on reasoning |
| Grok | grok-3-mini |
— |
| Mistral | mistral-small-latest |
— |
How to reproduce¶
# Cloud paragraph (any provider)
make experiment-run \
CONFIG=data/eval/configs/summarization/autoresearch_prompt_anthropic_smoke_paragraph_v1.yaml \
REFERENCE=silver_sonnet46_smoke_v1
# Cloud bullets (any provider)
make experiment-run \
CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_anthropic_smoke_bullets_v1.yaml \
REFERENCE=silver_sonnet46_smoke_bullets_v1
# Ollama paragraph
make experiment-run \
CONFIG=data/eval/configs/summarization/autoresearch_prompt_ollama_qwen35_35b_smoke_paragraph_v1.yaml \
REFERENCE=silver_sonnet46_smoke_v1
# Ollama bullets
make experiment-run \
CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_ollama_qwen35_35b_smoke_bullets_v1.yaml \
REFERENCE=silver_sonnet46_smoke_bullets_v1
Related¶
- Smoke v1 (2026-03) — previous report with GPT-4o silver
- Evaluation Reports index — methodology and metric definitions
- AI Provider Comparison Guide — decision guide
- Ollama Provider Guide — Ollama setup and model reference