Skip to content

Evaluation Report: Smoke v1 (April 2026)

Second full provider sweep — 6 cloud APIs + 12 Ollama local models, both paragraph and bullet JSON output, vs the new Claude Sonnet 4.6 silver reference. Includes prompt tuning (RFC-057 autoresearch round 2) results and a cloud vs. on-prem decision section.

Field Value
Date April 2026
Dataset curated_5feeds_smoke_v1 (5 episodes, 5 feeds)
Reference (paragraphs) silver_sonnet46_smoke_v1 (Claude Sonnet 4.6)
Reference (bullets) silver_sonnet46_smoke_bullets_v1 (Claude Sonnet 4.6)
Schema metrics_summarization_v2
Paragraph configs data/eval/configs/summarization/autoresearch_prompt_*_smoke_paragraph_v1.yaml
Bullet configs data/eval/configs/summarization_bullets/autoresearch_prompt_*_smoke_bullets_v1.yaml
Ollama paragraph configs data/eval/configs/summarization/autoresearch_prompt_ollama_*_smoke_paragraph_v1.yaml
Ollama bullet configs data/eval/configs/summarization_bullets/autoresearch_prompt_ollama_*_smoke_bullets_v1.yaml
Previous report Smoke v1 (2026-03)

For metric definitions and interpretation guidance, see the Evaluation Reports index.


What changed since March 2026

New silver reference

The March 2026 report used silver_gpt4o_smoke_v1 (GPT-4o summaries) as the reference. This report uses silver_sonnet46_smoke_v1 (Claude Sonnet 4.6), selected in April 2026 via a pairwise LLM judge tournament: Sonnet 4.6 beat GPT-5.4 (3-1-1) and swept Gemini 2.0 Flash (5-0) on 5 smoke episodes using dual OpenAI + Anthropic judges.

Why this matters for numbers: ROUGE and BLEU measure surface overlap with the reference. Changing the reference changes all scores — they are not directly comparable to the March 2026 figures. Specifically:

  • March 2026 numbers were inflated for OpenAI (GPT family scored against GPT reference). That bias is reduced here since the reference is Anthropic.
  • Ranking order shifts: Anthropic now leads cloud providers on ROUGE-L (32.6%) where it was second-to-last in March. OpenAI dropped from first (58.8%) to fifth (25.7%).
  • Ollama Qwen 3.5:35b and 27b now lead local models (29.9% and 29.2%), displacing Mistral Small 3.2 and Qwen 2.5:32b that tied in March (both at 38.4% vs GPT-4o silver).

The ranking shift is expected and meaningful: Sonnet 4.6 writes denser, more content-rich summaries than GPT-4o. Models that structure their output similarly (Anthropic family, verbose Qwen 3.5 models) score better against this reference.

Prompt tuning (RFC-057 round 2)

All 6 cloud provider prompts were tuned via the autoresearch ratchet loop (≥1% gain to accept, 3 consecutive fails = early stop) against silver_sonnet46_smoke_v1. Results:

Provider Round 1 score Round 2 score Key wins
Anthropic 0.287 0.523 Thesis opener (+0.201), vocab alignment (+0.024)
Gemini 0.446 0.475 Thesis opener + vocab + anchor
Grok 0.437 0.456 Thesis opener + vocab + anchor
DeepSeek 0.502 0.502 Saturated (focus line only, no new wins)
Mistral 0.468 0.480 Cause-effect relationships
OpenAI 0.474 0.474 Saturated (all new instructions caused judge divergence)

Anthropic had the largest gain because it started from the lowest baseline and had the most missing structural instructions. DeepSeek and OpenAI were already near-saturated.

Bullets track (new)

This report introduces a second output format: bullet JSON summaries (bullets_json_v1 template). The bullets silver reference (silver_sonnet46_smoke_bullets_v1) was generated by Sonnet 4.6 using the same shared template. Bullet scores are not comparable to paragraph scores — they are a separate track with a separate reference.

Round 1 bullet autoresearch (shared template): early stop after 3 fails — the template was already well-tuned from a prior session.

Model additions

Two new Ollama models added since the March sweep:

  • llama3.2:3b — smallest model in the set (2GB, 7.3s/ep)
  • llama3.3:70b-instruct-q3_K_M — largest model (34GB); crashed with OOM on this hardware and is excluded from results

Key takeaways

  1. Anthropic Haiku 4.5 is the best cloud provider on ROUGE-L (32.6%) — a complete reversal from March, driven by the new Sonnet 4.6 silver reference and round-2 prompt tuning. Its 86.8% embedding similarity is also the highest in the cloud set.
  2. For cloud providers, the ranking by ROUGE-L is now: Anthropic > DeepSeek > Gemini > OpenAI = Mistral > Grok. This is the opposite of the March order.
  3. Qwen 3.5:35b leads Ollama on ROUGE-L (29.9%), followed closely by Qwen 3.5:27b (29.2%). Both outperform every cloud provider except Anthropic.
  4. Mistral Small 3.2 remains a strong local option at 25.2% ROUGE-L with the best BLEU among Ollama paragraph models — but it fell from joint first to mid-table due to the reference change.
  5. On-prem vs. cloud: Qwen 3.5:35b (20s/ep local) is competitive with DeepSeek/Gemini/OpenAI on ROUGE-L while keeping data on-device. See the on-prem vs cloud section below.
  6. Bullets track: Anthropic leads cloud bullets (39.1% ROUGE-L), followed by Gemini (37.1%). DeepSeek leads on embedding similarity (85.0%). For local bullets, llama3.2:3b surprises at 35.5% ROUGE-L (fastest at 4.8s/ep); Mistral Small 3.2 leads on embedding (85.8%). qwen2.5:7b breaks on bullet JSON — skip it for this track.
  7. phi3:mini continues to show extreme verbosity (152.7% coverage, 141.1% WER in paragraphs) — embedding similarity is decent (79.6%) but length makes it unsuitable for production paragraph use.

Paragraph summaries — full metrics

Cloud providers (sorted by ROUGE-L)

Provider Model Lat/ep ROUGE-1 ROUGE-2 ROUGE-L BLEU Embed Coverage WER
Anthropic claude-haiku-4-5 5.1s 63.1% 24.9% 32.6% 18.9% 86.8% 91.0% 86.6%
DeepSeek deepseek-chat 9.8s 64.2% 23.9% 28.3% 16.7% 81.5% 81.4% 87.1%
Gemini gemini-2.0-flash 2.9s 54.9% 20.4% 27.6% 11.6% 81.6% 63.7% 86.3%
OpenAI gpt-4o-mini 8.2s 58.2% 19.4% 25.7% 13.3% 82.5% 81.7% 90.6%
Mistral mistral-small-latest 4.0s 61.8% 21.2% 25.7% 16.5% 80.1% 105.9% 96.4%
Grok grok-3-mini 8.9s 57.2% 17.6% 25.1% 11.4% 77.8% 81.3% 90.9%

Local Ollama — paragraphs (sorted by ROUGE-L)

Hardware: Apple M-series. Re-run on your machine for latency decisions.

Model Tag Lat/ep ROUGE-1 ROUGE-2 ROUGE-L BLEU Embed Coverage WER
qwen3.5:35b qwen3.5:35b 20.4s 65.1% 26.0% 29.9% 18.4% 81.8% 94.0% 91.7%
qwen3.5:27b qwen3.5:27b 82.8s 61.5% 23.5% 29.2% 18.5% 80.4% 103.8% 96.9%
mistral-small3.2 mistral-small3.2:latest 45.5s 58.1% 19.3% 25.2% 11.4% 77.1% 73.8% 89.0%
qwen2.5:32b qwen2.5:32b 58.5s 54.3% 16.8% 25.4% 9.1% 76.6% 63.3% 88.5%
qwen3.5:9b qwen3.5:9b 23.9s 59.2% 20.8% 24.7% 12.1% 75.8% 82.9% 92.1%
mistral-nemo:12b mistral-nemo:12b 24.6s 53.4% 18.8% 23.6% 12.2% 77.2% 91.2% 96.3%
llama3.1:8b llama3.1:8b 15.7s 57.4% 18.3% 23.0% 13.4% 77.9% 88.8% 95.1%
llama3.2:3b llama3.2:3b 7.3s 56.3% 18.5% 22.6% 12.4% 78.9% 85.4% 93.0%
mistral:7b mistral:7b 16.1s 54.0% 18.4% 22.3% 11.6% 71.4% 78.3% 91.3%
qwen2.5:7b qwen2.5:7b 13.8s 55.1% 17.4% 21.4% 10.8% 78.4% 77.2% 92.6%
gemma2:9b gemma2:9b 16.4s 50.4% 13.0% 21.1% 7.1% 78.0% 72.2% 90.3%
phi3:mini phi3:mini 16.5s 50.3% 12.1% 18.0% 6.9% 79.6% 152.7% 141.1%
llama3.3:70b-q3km

llama3.3:70b-instruct-q3_K_M (34GB): model runner crashed with OOM on this hardware. Excluded from results.


Bullet JSON summaries — full metrics

Bullet ROUGE scores are computed against silver_sonnet46_smoke_bullets_v1 (JSON bullet text extracted and concatenated). Not comparable to paragraph scores.

Cloud providers — bullets (sorted by ROUGE-L)

Provider Model Lat/ep ROUGE-1 ROUGE-2 ROUGE-L BLEU Embed
Anthropic claude-haiku-4-5 3.4s 62.3% 30.6% 39.1% 32.2% 82.8%
Gemini gemini-2.0-flash 1.9s 64.1% 35.1% 37.1% 30.7% 82.9%
DeepSeek deepseek-chat 8.8s 65.0% 32.3% 34.1% 30.0% 85.0%
Grok grok-3-mini 10.8s 55.2% 25.2% 32.3% 22.6% 84.3%
OpenAI gpt-4o-mini 3.3s 54.9% 27.6% 30.7% 23.1% 84.0%
Mistral mistral-small-latest 2.6s 59.9% 27.4% 29.8% 28.8% 84.2%

Local Ollama — bullets (sorted by ROUGE-L)

Model Tag Lat/ep ROUGE-1 ROUGE-L Embed
llama3.2:3b llama3.2:3b 4.8s 57.0% 35.5% 79.8%
mistral-small3.2 mistral-small3.2:latest 30.5s 62.0% 34.9% 85.8%
qwen3.5:27b qwen3.5:27b 53.6s 63.4% 33.5% 84.7%
qwen3.5:9b qwen3.5:9b 15.4s 60.7% 33.3% 83.4%
qwen3.5:35b qwen3.5:35b 16.4s 63.0% 30.8% 84.6%
qwen2.5:32b qwen2.5:32b 40.9s 52.9% 30.1% 82.4%
gemma2:9b gemma2:9b 11.7s 50.4% 31.1% 81.3%
phi3:mini phi3:mini 8.0s 53.4% 30.7% 78.4%
llama3.1:8b llama3.1:8b 10.3s 54.2% 29.6% 77.3%
mistral:7b mistral:7b 11.1s 52.5% 27.8% 81.2%
mistral-nemo:12b mistral-nemo:12b 16.2s 53.1% 27.6% 77.9%
qwen2.5:7b qwen2.5:7b 7.3s 18.9% 11.9% 52.0%
llama3.3:70b-q3km OOM

qwen2.5:7b (11.9% ROUGE-L, 52.0% embed) does not reliably follow the JSON bullet format — inspect predictions before using. All other models produce valid JSON output. llama3.3:70b-instruct-q3_K_M crashed with OOM on this hardware.


On-prem Ollama vs cloud API

Decision guide for teams choosing between local and cloud summarization.

This section compares the best on-prem options against cloud APIs across the dimensions that matter for a production deployment: quality, latency, cost, and privacy.

Quality (ROUGE-L, paragraphs)

Anthropic Haiku 4.5 (cloud):    32.6%  ██████████████████████████████████
DeepSeek (cloud):               28.3%  ████████████████████████████
Gemini 2.0 Flash (cloud):       27.6%  ███████████████████████████
─── on-prem competitive zone ───────────────────────────────────────────
qwen3.5:35b (local, 20s):       29.9%  ██████████████████████████████
qwen3.5:27b (local, 83s):       29.2%  █████████████████████████████
mistral-small3.2 (local, 46s):  25.2%  █████████████████████████
qwen2.5:32b (local, 59s):       25.4%  █████████████████████████
─── below cloud floor ───────────────────────────────────────────────────
qwen3.5:9b (local, 24s):        24.7%  ████████████████████████
llama3.1:8b (local, 16s):       23.0%  ███████████████████████

qwen3.5:35b is the only on-prem model that consistently exceeds the cloud median (~27%). It outperforms Gemini 2.0 Flash and is within 3 points of Anthropic.

Latency

  • Cloud leaders: Gemini 2.0 Flash (2.9s/ep), Mistral API (4.0s/ep) for paragraphs; Gemini (1.9s/ep) for bullets.
  • On-prem: qwen3.5:35b at 20s/ep is 7–10× slower than the fastest cloud APIs, but processes 5 episodes in ~100s total — acceptable for batch/background jobs.
  • On-prem fast tier: llama3.2:3b (7.3s/ep), qwen2.5:7b (13.8s/ep) are the fastest local options, but their quality is below the cloud floor.

Privacy and cost

Scenario Recommendation
Production, quality-sensitive Anthropic Haiku 4.5 (cloud)
Production, cost-sensitive Gemini 2.0 Flash (cloud, fastest + strong quality)
On-prem required, quality first qwen3.5:35b (local, 35B params, 23GB VRAM)
On-prem required, speed/quality balance mistral-small3.2:latest (local, 15GB)
On-prem required, low-resource llama3.2:3b (local, 2GB, 7.3s/ep)
Bullet JSON, cloud Anthropic Haiku 4.5 (highest ROUGE-L) or Gemini (fastest)
Bullet JSON, on-prem, speed llama3.2:3b (35.5% ROUGE-L, 4.8s/ep — best fast option)
Bullet JSON, on-prem, quality mistral-small3.2 (34.9% ROUGE-L, best embed 85.8%)

Summary

If privacy or data-residency requirements mandate on-prem: qwen3.5:35b is the only local model that reaches cloud-competitive quality (29.9% ROUGE-L vs 27–32% cloud range). If latency is the constraint, drop to mistral-small3.2 (46s/ep, 25.2%). Everything below that tier is at least 4 percentage points behind the worst cloud API.


Why the rankings changed vs. March 2026

What changed Effect
Silver reference: GPT-4o → Sonnet 4.6 GPT-family models lost their reference advantage; Anthropic gained one
Prompt tuning (round 2) for cloud APIs Anthropic prompt improved most (+82%); DeepSeek/OpenAI already saturated
New Ollama models (llama3.2:3b, llama3.3:70b) 3B adds a fast low-quality option; 70B OOM on this hardware
Reference is denser/longer (Sonnet 4.6) All ROUGE-L numbers are lower than March; models that match density score better

The most important takeaway: the reference matters more than the model. OpenAI dropped from 58.8% to 25.7% ROUGE-L not because it got worse, but because it no longer shares a generation family with the reference. Always re-run comparisons when the reference changes.


Model IDs (acceptance-tested)

Provider Model ID used Notes
Anthropic claude-haiku-4-5 Fastest Anthropic option; Sonnet 4.6 used for silver only
OpenAI gpt-4o-mini max_completion_tokens required for gpt-5 series
DeepSeek deepseek-chat
Gemini gemini-2.0-flash gemini-2.5-pro / gemini-3.1-pro-preview are thinking models; exhaust tokens on reasoning
Grok grok-3-mini
Mistral mistral-small-latest

How to reproduce

# Cloud paragraph (any provider)
make experiment-run \
  CONFIG=data/eval/configs/summarization/autoresearch_prompt_anthropic_smoke_paragraph_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_v1

# Cloud bullets (any provider)
make experiment-run \
  CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_anthropic_smoke_bullets_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_bullets_v1

# Ollama paragraph
make experiment-run \
  CONFIG=data/eval/configs/summarization/autoresearch_prompt_ollama_qwen35_35b_smoke_paragraph_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_v1

# Ollama bullets
make experiment-run \
  CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_ollama_qwen35_35b_smoke_bullets_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_bullets_v1