Evaluation Report: Smoke v1 (April 2026)¶

Second full provider sweep — 6 cloud APIs + 12 Ollama local models, both paragraph and bullet JSON output, vs the new Claude Sonnet 4.6 silver reference. Includes prompt tuning (RFC-057 autoresearch round 2) results and a cloud vs. on-prem decision section.

Field	Value
Date	April 2026
Dataset	`curated_5feeds_smoke_v1` (5 episodes, 5 feeds)
Reference (paragraphs)	`silver_sonnet46_smoke_v1` (Claude Sonnet 4.6)
Reference (bullets)	`silver_sonnet46_smoke_bullets_v1` (Claude Sonnet 4.6)
Schema	`metrics_summarization_v2`
Paragraph configs	`data/eval/configs/summarization/autoresearch_prompt_*_smoke_paragraph_v1.yaml`
Bullet configs	`data/eval/configs/summarization_bullets/autoresearch_prompt_*_smoke_bullets_v1.yaml`
Ollama paragraph configs	`data/eval/configs/summarization/autoresearch_prompt_ollama_*_smoke_paragraph_v1.yaml`
Ollama bullet configs	`data/eval/configs/summarization_bullets/autoresearch_prompt_ollama_*_smoke_bullets_v1.yaml`
Previous report	Smoke v1 (2026-03)

For metric definitions and interpretation guidance, see the Evaluation Reports index.

What changed since March 2026¶

New silver reference¶

The March 2026 report used silver_gpt4o_smoke_v1 (GPT-4o summaries) as the reference. This report uses silver_sonnet46_smoke_v1 (Claude Sonnet 4.6), selected in April 2026 via a pairwise LLM judge tournament: Sonnet 4.6 beat GPT-5.4 (3-1-1) and swept Gemini 2.0 Flash (5-0) on 5 smoke episodes using dual OpenAI + Anthropic judges.

Why this matters for numbers: ROUGE and BLEU measure surface overlap with the reference. Changing the reference changes all scores — they are not directly comparable to the March 2026 figures. Specifically:

March 2026 numbers were inflated for OpenAI (GPT family scored against GPT reference). That bias is reduced here since the reference is Anthropic.
Ranking order shifts: Anthropic now leads cloud providers on ROUGE-L (32.6%) where it was second-to-last in March. OpenAI dropped from first (58.8%) to fifth (25.7%).
Ollama Qwen 3.5:35b and 27b now lead local models (29.9% and 29.2%), displacing Mistral Small 3.2 and Qwen 2.5:32b that tied in March (both at 38.4% vs GPT-4o silver).

The ranking shift is expected and meaningful: Sonnet 4.6 writes denser, more content-rich summaries than GPT-4o. Models that structure their output similarly (Anthropic family, verbose Qwen 3.5 models) score better against this reference.

Prompt tuning (RFC-057 round 2)¶

All 6 cloud provider prompts were tuned via the autoresearch ratchet loop (≥1% gain to accept, 3 consecutive fails = early stop) against silver_sonnet46_smoke_v1. Results:

Provider	Round 1 score	Round 2 score	Key wins
Anthropic	0.287	0.523	Thesis opener (+0.201), vocab alignment (+0.024)
Gemini	0.446	0.475	Thesis opener + vocab + anchor
Grok	0.437	0.456	Thesis opener + vocab + anchor
DeepSeek	0.502	0.502	Saturated (focus line only, no new wins)
Mistral	0.468	0.480	Cause-effect relationships
OpenAI	0.474	0.474	Saturated (all new instructions caused judge divergence)

Anthropic had the largest gain because it started from the lowest baseline and had the most missing structural instructions. DeepSeek and OpenAI were already near-saturated.

Bullets track (new)¶

This report introduces a second output format: bullet JSON summaries (bullets_json_v1 template). The bullets silver reference (silver_sonnet46_smoke_bullets_v1) was generated by Sonnet 4.6 using the same shared template. Bullet scores are not comparable to paragraph scores — they are a separate track with a separate reference.

Round 1 bullet autoresearch (shared template): early stop after 3 fails — the template was already well-tuned from a prior session.

Model additions¶

Two new Ollama models added since the March sweep:

llama3.2:3b — smallest model in the set (2GB, 7.3s/ep)
llama3.3:70b-instruct-q3_K_M — largest model (34GB); crashed with OOM on this hardware and is excluded from results

Key takeaways¶

Anthropic Haiku 4.5 is the best cloud provider on ROUGE-L (32.6%) — a complete reversal from March, driven by the new Sonnet 4.6 silver reference and round-2 prompt tuning. Its 86.8% embedding similarity is also the highest in the cloud set.
For cloud providers, the ranking by ROUGE-L is now: Anthropic > DeepSeek > Gemini > OpenAI = Mistral > Grok. This is the opposite of the March order.
Qwen 3.5:35b leads Ollama on ROUGE-L (29.9%), followed closely by Qwen 3.5:27b (29.2%). Both outperform every cloud provider except Anthropic.
Mistral Small 3.2 remains a strong local option at 25.2% ROUGE-L with the best BLEU among Ollama paragraph models — but it fell from joint first to mid-table due to the reference change.
On-prem vs. cloud: Qwen 3.5:35b (20s/ep local) is competitive with DeepSeek/Gemini/OpenAI on ROUGE-L while keeping data on-device. See the on-prem vs cloud section below.
Bullets track: Anthropic leads cloud bullets (39.1% ROUGE-L), followed by Gemini (37.1%). DeepSeek leads on embedding similarity (85.0%). For local bullets, llama3.2:3b surprises at 35.5% ROUGE-L (fastest at 4.8s/ep); Mistral Small 3.2 leads on embedding (85.8%). qwen2.5:7b breaks on bullet JSON — skip it for this track.
phi3:mini continues to show extreme verbosity (152.7% coverage, 141.1% WER in paragraphs) — embedding similarity is decent (79.6%) but length makes it unsuitable for production paragraph use.

Paragraph summaries — full metrics¶

Cloud providers (sorted by ROUGE-L)¶

Provider	Model	Lat/ep	ROUGE-1	ROUGE-2	ROUGE-L	BLEU	Embed	Coverage	WER
Anthropic	claude-haiku-4-5	5.1s	63.1%	24.9%	32.6%	18.9%	86.8%	91.0%	86.6%
DeepSeek	deepseek-chat	9.8s	64.2%	23.9%	28.3%	16.7%	81.5%	81.4%	87.1%
Gemini	gemini-2.0-flash	2.9s	54.9%	20.4%	27.6%	11.6%	81.6%	63.7%	86.3%
OpenAI	gpt-4o-mini	8.2s	58.2%	19.4%	25.7%	13.3%	82.5%	81.7%	90.6%
Mistral	mistral-small-latest	4.0s	61.8%	21.2%	25.7%	16.5%	80.1%	105.9%	96.4%
Grok	grok-3-mini	8.9s	57.2%	17.6%	25.1%	11.4%	77.8%	81.3%	90.9%

Local Ollama — paragraphs (sorted by ROUGE-L)¶

Hardware: Apple M-series. Re-run on your machine for latency decisions.

Model	Tag	Lat/ep	ROUGE-1	ROUGE-2	ROUGE-L	BLEU	Embed	Coverage	WER
qwen3.5:35b	qwen3.5:35b	20.4s	65.1%	26.0%	29.9%	18.4%	81.8%	94.0%	91.7%
qwen3.5:27b	qwen3.5:27b	82.8s	61.5%	23.5%	29.2%	18.5%	80.4%	103.8%	96.9%
mistral-small3.2	mistral-small3.2:latest	45.5s	58.1%	19.3%	25.2%	11.4%	77.1%	73.8%	89.0%
qwen2.5:32b	qwen2.5:32b	58.5s	54.3%	16.8%	25.4%	9.1%	76.6%	63.3%	88.5%
qwen3.5:9b	qwen3.5:9b	23.9s	59.2%	20.8%	24.7%	12.1%	75.8%	82.9%	92.1%
mistral-nemo:12b	mistral-nemo:12b	24.6s	53.4%	18.8%	23.6%	12.2%	77.2%	91.2%	96.3%
llama3.1:8b	llama3.1:8b	15.7s	57.4%	18.3%	23.0%	13.4%	77.9%	88.8%	95.1%
llama3.2:3b	llama3.2:3b	7.3s	56.3%	18.5%	22.6%	12.4%	78.9%	85.4%	93.0%
mistral:7b	mistral:7b	16.1s	54.0%	18.4%	22.3%	11.6%	71.4%	78.3%	91.3%
qwen2.5:7b	qwen2.5:7b	13.8s	55.1%	17.4%	21.4%	10.8%	78.4%	77.2%	92.6%
gemma2:9b	gemma2:9b	16.4s	50.4%	13.0%	21.1%	7.1%	78.0%	72.2%	90.3%
phi3:mini	phi3:mini	16.5s	50.3%	12.1%	18.0%	6.9%	79.6%	152.7%	141.1%
llama3.3:70b-q3km	—	—	—	—	—	—	—	—	—

llama3.3:70b-instruct-q3_K_M (34GB): model runner crashed with OOM on this hardware. Excluded from results.

Bullet JSON summaries — full metrics¶

Bullet ROUGE scores are computed against silver_sonnet46_smoke_bullets_v1 (JSON bullet text extracted and concatenated). Not comparable to paragraph scores.

Cloud providers — bullets (sorted by ROUGE-L)¶

Provider	Model	Lat/ep	ROUGE-1	ROUGE-2	ROUGE-L	BLEU	Embed
Anthropic	claude-haiku-4-5	3.4s	62.3%	30.6%	39.1%	32.2%	82.8%
Gemini	gemini-2.0-flash	1.9s	64.1%	35.1%	37.1%	30.7%	82.9%
DeepSeek	deepseek-chat	8.8s	65.0%	32.3%	34.1%	30.0%	85.0%
Grok	grok-3-mini	10.8s	55.2%	25.2%	32.3%	22.6%	84.3%
OpenAI	gpt-4o-mini	3.3s	54.9%	27.6%	30.7%	23.1%	84.0%
Mistral	mistral-small-latest	2.6s	59.9%	27.4%	29.8%	28.8%	84.2%

Local Ollama — bullets (sorted by ROUGE-L)¶

Model	Tag	Lat/ep	ROUGE-1	ROUGE-L	Embed
llama3.2:3b	llama3.2:3b	4.8s	57.0%	35.5%	79.8%
mistral-small3.2	mistral-small3.2:latest	30.5s	62.0%	34.9%	85.8%
qwen3.5:27b	qwen3.5:27b	53.6s	63.4%	33.5%	84.7%
qwen3.5:9b	qwen3.5:9b	15.4s	60.7%	33.3%	83.4%
qwen3.5:35b	qwen3.5:35b	16.4s	63.0%	30.8%	84.6%
qwen2.5:32b	qwen2.5:32b	40.9s	52.9%	30.1%	82.4%
gemma2:9b	gemma2:9b	11.7s	50.4%	31.1%	81.3%
phi3:mini	phi3:mini	8.0s	53.4%	30.7%	78.4%
llama3.1:8b	llama3.1:8b	10.3s	54.2%	29.6%	77.3%
mistral:7b	mistral:7b	11.1s	52.5%	27.8%	81.2%
mistral-nemo:12b	mistral-nemo:12b	16.2s	53.1%	27.6%	77.9%
qwen2.5:7b	qwen2.5:7b	7.3s	18.9%	11.9%	52.0%
llama3.3:70b-q3km	—	—	—	OOM	—

qwen2.5:7b (11.9% ROUGE-L, 52.0% embed) does not reliably follow the JSON bullet format — inspect predictions before using. All other models produce valid JSON output. llama3.3:70b-instruct-q3_K_M crashed with OOM on this hardware.

On-prem Ollama vs cloud API¶

Decision guide for teams choosing between local and cloud summarization.

This section compares the best on-prem options against cloud APIs across the dimensions that matter for a production deployment: quality, latency, cost, and privacy.

Quality (ROUGE-L, paragraphs)¶

Anthropic Haiku 4.5 (cloud):    32.6%  ██████████████████████████████████
DeepSeek (cloud):               28.3%  ████████████████████████████
Gemini 2.0 Flash (cloud):       27.6%  ███████████████████████████
─── on-prem competitive zone ───────────────────────────────────────────
qwen3.5:35b (local, 20s):       29.9%  ██████████████████████████████
qwen3.5:27b (local, 83s):       29.2%  █████████████████████████████
mistral-small3.2 (local, 46s):  25.2%  █████████████████████████
qwen2.5:32b (local, 59s):       25.4%  █████████████████████████
─── below cloud floor ───────────────────────────────────────────────────
qwen3.5:9b (local, 24s):        24.7%  ████████████████████████
llama3.1:8b (local, 16s):       23.0%  ███████████████████████

qwen3.5:35b is the only on-prem model that consistently exceeds the cloud median (~27%). It outperforms Gemini 2.0 Flash and is within 3 points of Anthropic.

Latency¶

Cloud leaders: Gemini 2.0 Flash (2.9s/ep), Mistral API (4.0s/ep) for paragraphs; Gemini (1.9s/ep) for bullets.
On-prem: qwen3.5:35b at 20s/ep is 7–10× slower than the fastest cloud APIs, but processes 5 episodes in ~100s total — acceptable for batch/background jobs.
On-prem fast tier: llama3.2:3b (7.3s/ep), qwen2.5:7b (13.8s/ep) are the fastest local options, but their quality is below the cloud floor.

Privacy and cost¶

Scenario	Recommendation
Production, quality-sensitive	Anthropic Haiku 4.5 (cloud)
Production, cost-sensitive	Gemini 2.0 Flash (cloud, fastest + strong quality)
On-prem required, quality first	qwen3.5:35b (local, 35B params, 23GB VRAM)
On-prem required, speed/quality balance	mistral-small3.2:latest (local, 15GB)
On-prem required, low-resource	llama3.2:3b (local, 2GB, 7.3s/ep)
Bullet JSON, cloud	Anthropic Haiku 4.5 (highest ROUGE-L) or Gemini (fastest)
Bullet JSON, on-prem, speed	llama3.2:3b (35.5% ROUGE-L, 4.8s/ep — best fast option)
Bullet JSON, on-prem, quality	mistral-small3.2 (34.9% ROUGE-L, best embed 85.8%)

Summary¶

If privacy or data-residency requirements mandate on-prem: qwen3.5:35b is the only local model that reaches cloud-competitive quality (29.9% ROUGE-L vs 27–32% cloud range). If latency is the constraint, drop to mistral-small3.2 (46s/ep, 25.2%). Everything below that tier is at least 4 percentage points behind the worst cloud API.

Why the rankings changed vs. March 2026¶

What changed	Effect
Silver reference: GPT-4o → Sonnet 4.6	GPT-family models lost their reference advantage; Anthropic gained one
Prompt tuning (round 2) for cloud APIs	Anthropic prompt improved most (+82%); DeepSeek/OpenAI already saturated
New Ollama models (llama3.2:3b, llama3.3:70b)	3B adds a fast low-quality option; 70B OOM on this hardware
Reference is denser/longer (Sonnet 4.6)	All ROUGE-L numbers are lower than March; models that match density score better

The most important takeaway: the reference matters more than the model. OpenAI dropped from 58.8% to 25.7% ROUGE-L not because it got worse, but because it no longer shares a generation family with the reference. Always re-run comparisons when the reference changes.

Model IDs (acceptance-tested)¶

Provider	Model ID used	Notes
Anthropic	`claude-haiku-4-5`	Fastest Anthropic option; Sonnet 4.6 used for silver only
OpenAI	`gpt-4o-mini`	`max_completion_tokens` required for gpt-5 series
DeepSeek	`deepseek-chat`	—
Gemini	`gemini-2.0-flash`	`gemini-2.5-pro` / `gemini-3.1-pro-preview` are thinking models; exhaust tokens on reasoning
Grok	`grok-3-mini`	—
Mistral	`mistral-small-latest`	—

How to reproduce¶

# Cloud paragraph (any provider)
make experiment-run \
  CONFIG=data/eval/configs/summarization/autoresearch_prompt_anthropic_smoke_paragraph_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_v1

# Cloud bullets (any provider)
make experiment-run \
  CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_anthropic_smoke_bullets_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_bullets_v1

# Ollama paragraph
make experiment-run \
  CONFIG=data/eval/configs/summarization/autoresearch_prompt_ollama_qwen35_35b_smoke_paragraph_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_v1

# Ollama bullets
make experiment-run \
  CONFIG=data/eval/configs/summarization_bullets/autoresearch_prompt_ollama_qwen35_35b_smoke_bullets_v1.yaml \
  REFERENCE=silver_sonnet46_smoke_bullets_v1

Smoke v1 (2026-03) — previous report with GPT-4o silver
Evaluation Reports index — methodology and metric definitions
AI Provider Comparison Guide — decision guide
Ollama Provider Guide — Ollama setup and model reference