Evaluation Report: Smoke v1 (March 2026)¶

First full provider sweep — 6 cloud APIs + 11 Ollama local models vs silver GPT-4o reference.

Field	Value
Date	March 2026
Dataset	`curated_5feeds_smoke_v1` (5 episodes, 5 feeds)
Reference	`silver_gpt4o_smoke_v1` (GPT-4o summaries)
Schema	`metrics_summarization_v2`
Configs	`data/eval/configs/llm_*_smoke_v1.yaml`
Run data	`data/eval/runs/llm_*_smoke_v1/metrics.json`

For metric definitions and interpretation guidance, see the Evaluation Reports index.

Key takeaways¶

OpenAI GPT-4o scores highest across all metrics — expected, since the silver reference is also GPT-4o. Use non-OpenAI comparisons for a fairer picture.
Gemini (gemini-2.0-flash) is the best non-OpenAI cloud provider on ROUGE-L (33.3%) and embedding similarity (87.3%), with the fastest cloud latency (2.7s/ep).
Mistral API (mistral-small-latest) is a close second on ROUGE-L (32.5%) and the fastest cloud provider at 2.8s/ep.
On Ollama, Mistral Small 3.2 and Qwen 2.5:32b tie at 38.4% ROUGE-L — both outperform every cloud provider except OpenAI. Mistral Small 3.2 leads on ROUGE-1, BLEU, and embedding similarity.
Qwen 3.5:9b (with reasoning_effort: none) is the best smaller Qwen 3.5 model (~30% ROUGE-L, strong 85.2% embedding).
phi3:mini has the highest embedding similarity among Ollama models (86.9%) but extreme verbosity (189% coverage, 173% WER) — check output length before using.

Full metrics table¶

All providers, sorted by category (cloud first, then Ollama), then by ROUGE-L descending within each category.

Run	Latency/ep	ROUGE-1	ROUGE-2	ROUGE-L	BLEU	Embed	Coverage	WER
llm_openai_smoke_v1	15.4s	77.2%	54.0%	58.8%	50.0%	92.7%	100.1%	61.5%
llm_gemini_smoke_v1	2.7s	61.3%	23.4%	33.3%	16.3%	87.3%	78.1%	83.8%
llm_mistral_smoke_v1	2.8s	59.7%	23.8%	32.5%	17.2%	84.8%	99.5%	92.5%
llm_grok_smoke_v1	13.2s	59.0%	21.9%	29.5%	15.5%	85.4%	90.9%	89.1%
llm_anthropic_smoke_v1	4.8s	58.8%	23.6%	29.4%	16.9%	81.8%	100.7%	93.2%
llm_deepseek_smoke_v1	14.2s	60.5%	22.3%	26.3%	16.1%	85.0%	100.7%	98.5%
llm_ollama_qwen25_32b_smoke_v1	54.8s	65.6%	30.8%	38.4%	22.0%	85.2%	78.4%	81.3%
llm_ollama_mistral_small3_2_smoke_v1	48.6s	69.4%	33.0%	38.4%	27.8%	85.8%	91.4%	83.2%
llm_ollama_mistral_7b_smoke_v1	17.4s	61.4%	24.8%	32.8%	18.9%	80.4%	97.0%	92.2%
llm_ollama_mistral_nemo_12b_smoke_v1	22.8s	59.2%	23.4%	30.7%	16.0%	82.5%	112.9%	102.9%
llm_ollama_qwen35_9b_smoke_v1	21.9s	60.1%	23.2%	30.3%	16.4%	85.2%	102.7%	93.9%
llm_ollama_qwen35_27b_smoke_v1	81.5s	61.2%	24.8%	30.2%	17.2%	82.4%	128.5%	113.1%
llm_ollama_qwen35_35b_smoke_v1	23.7s	61.1%	23.7%	30.1%	17.5%	80.8%	116.4%	108.9%
llm_ollama_llama31_8b_smoke_v1	16.4s	59.6%	25.7%	28.8%	20.0%	78.8%	110.0%	104.4%
llm_ollama_qwen25_7b_smoke_v1	12.1s	63.0%	24.1%	28.3%	17.7%	84.9%	95.6%	95.3%
llm_ollama_gemma2_9b_smoke_v1	19.6s	58.2%	19.0%	26.7%	13.1%	82.8%	89.4%	89.2%
llm_ollama_phi3_mini_smoke_v1	17.2s	53.0%	17.2%	22.6%	9.6%	86.9%	189.1%	172.9%

Cloud LLMs (ranked by ROUGE-L)¶

Rank	Provider	Model (eval config)	ROUGE-L	Embed	Latency	Note
1	OpenAI	GPT-4o	58.8%	92.7%	15.4s	Same family as silver reference
2	Gemini	gemini-2.0-flash	33.3%	87.3%	2.7s	Best non-OpenAI ROUGE-L + embed
3	Mistral	mistral-small-latest	32.5%	84.8%	2.8s	Very fast; strong ROUGE-L
4	Grok	grok-3-mini	29.5%	85.4%	13.2s	Higher latency than Gemini/Mistral
5	Anthropic	claude-haiku-4-5	29.4%	81.8%	4.8s	Lowest embed in cloud set
6	DeepSeek	(API default)	26.3%	85.0%	14.2s	Lowest ROUGE-L; solid embed

Local Ollama (ranked by ROUGE-L)¶

Hardware and Ollama builds vary; re-run on your machine for latency-based decisions.

Run	ROUGE-L	Embed	Latency	Note
llm_ollama_qwen25_32b_smoke_v1	38.4%	85.2%	54.8s	Ties top ROUGE-L; `qwen2.5:32b`
llm_ollama_mistral_small3_2_smoke_v1	38.4%	85.8%	48.6s	Ties top; best ROUGE-1/BLEU/embed; `mistral-small3.2:latest`
llm_ollama_mistral_7b_smoke_v1	32.8%	80.4%	17.4s	Strong for 7B; `mistral:7b`
llm_ollama_mistral_nemo_12b_smoke_v1	30.7%	82.5%	22.8s	Good balance; `mistral-nemo:12b`
llm_ollama_qwen35_9b_smoke_v1	30.3%	85.2%	21.9s	Best Qwen 3.5 size/quality; use `reasoning_effort: none`
llm_ollama_qwen35_27b_smoke_v1	30.2%	82.4%	81.5s	Similar to 9B/35B; slowest in block
llm_ollama_qwen35_35b_smoke_v1	30.1%	80.8%	23.7s	Similar ROUGE-L; faster than 27B
llm_ollama_llama31_8b_smoke_v1	28.8%	78.8%	16.4s	`llama3.1:8b`
llm_ollama_qwen25_7b_smoke_v1	28.3%	84.9%	12.1s	Fastest Ollama row; `qwen2.5:7b`
llm_ollama_gemma2_9b_smoke_v1	26.7%	82.8%	19.6s	`gemma2:9b`
llm_ollama_phi3_mini_smoke_v1	22.6%	86.9%	17.2s	High embed but extreme verbosity (189% coverage)

Model IDs (acceptance-tested)¶

Use these model IDs in eval/acceptance configs to avoid API errors:

Provider	Recommended model ID	Deprecated / not found
Anthropic	`claude-haiku-4-5`	`claude-3-5-haiku-20241022` (404 deprecated)
Gemini	`gemini-2.0-flash`	—
Grok	`grok-3-mini`	`grok-2` (400 model not found)

Eval configs: data/eval/configs/llm_*_smoke_v1.yaml. Acceptance configs: config/acceptance/summarization/acceptance_planet_money_*.yaml.

Observations and notes¶

Qwen 3.5 family (9b/27b/35b): All three sizes cluster tightly around 30% ROUGE-L. The 9B model is the best value — similar quality at a fraction of the latency of 27B. Use reasoning_effort: none in the Ollama path for Qwen 3.5.
Mistral family: Mistral Small 3.2 (local) outperforms the Mistral API (mistral-small-latest) on ROUGE-L (38.4% vs 32.5%), likely because the local model runs without the API's token limits and timeout constraints.
phi3:mini: Despite having the highest embedding similarity (86.9%) among Ollama models, its 189% coverage ratio and 173% WER indicate severe verbosity. Inspect output length before using in production.
Silver reference bias: OpenAI's 58.8% ROUGE-L and 92.7% embedding similarity reflect family similarity, not necessarily superior quality. For a fairer comparison, focus on the non-OpenAI rows.

How to reproduce¶

# Cloud providers (requires API keys)
make experiment-run CONFIG=data/eval/configs/llm_openai_smoke_v1.yaml \
     REFERENCE=silver_gpt4o_smoke_v1

# Ollama (requires ollama serve + model pulled)
make experiment-run CONFIG=data/eval/configs/llm_ollama_qwen25_32b_smoke_v1.yaml \
     REFERENCE=silver_gpt4o_smoke_v1

Evaluation Reports index — methodology and metric definitions
AI Provider Comparison Guide — decision guide using these results
Provider Deep Dives — per-provider reference cards with links to measured performance