Evaluation Reports¶

Methodology, metrics rationale, and a library of every provider comparison we have run.

This index is the single entry point for understanding how we measure summarization quality and why each metric was chosen. Individual reports (linked below) contain the raw numbers; this page gives you the context to interpret them.

For the full metric catalog (intrinsic gates, length, cost, performance) see the Metrics Guide. This page focuses on the vs-reference comparison metrics used across provider evaluation reports.

Why we measure¶

Choosing a summarization provider is a multi-dimensional decision: quality, latency, cost, and privacy all matter. Subjective "this looks good" assessments do not scale and cannot be reproduced. Our evaluation framework produces repeatable, quantitative comparisons so that every provider claim in the AI Provider Comparison Guide is backed by data.

Methodology¶

Dataset¶

All smoke evaluations use the curated_5feeds_smoke_v1 dataset: 5 episodes drawn from 5 distinct podcast feeds, covering a range of topics and episode lengths. The dataset is frozen (versioned in data/eval/datasets/) so that every run processes identical input.

Silver reference¶

We compare against a silver reference (silver_gpt4o_smoke_v1) — summaries generated by GPT-4o on the same episodes. A silver reference is pragmatic: it is cheaper and faster to produce than human-written gold references, and GPT-4o is widely regarded as a strong summarizer. The trade-off is that OpenAI models will always score highest because they share the same generation family as the reference. We acknowledge this bias explicitly in every report.

Execution¶

Each run is executed with:

make experiment-run CONFIG=data/eval/configs/<config>.yaml \
     REFERENCE=silver_gpt4o_smoke_v1

The runner processes every episode, writes predictions.jsonl, then calls score_run() to compute metrics. Results land in data/eval/runs/<run_id>/metrics.json.

Hardware note¶

Latency numbers for Ollama (local) runs depend heavily on hardware (GPU VRAM, CPU cores, memory bandwidth). Cloud API latencies depend on network and provider load. Always re-run on your own machine before making latency-based decisions.

Metrics explained¶

Each metric captures a different facet of summarization quality. No single number tells the whole story — read them together.

ROUGE-1 / ROUGE-2 / ROUGE-L¶

What it measures: N-gram overlap between the generated summary and the reference.

ROUGE-1 — unigram (single word) recall and precision. Captures vocabulary overlap.
ROUGE-2 — bigram overlap. Captures phrase-level similarity; more sensitive to word order than ROUGE-1.
ROUGE-L — longest common subsequence. Rewards summaries that preserve the reference's sentence structure without requiring exact n-gram matches.

Why it is a good choice: ROUGE is the de-facto standard for summarization evaluation in NLP research (Lin, 2004). It is fast, deterministic, and well-understood. Its main limitation is that it only measures surface-level overlap — two sentences can mean the same thing with zero shared words and ROUGE will score them at 0.

Interpretation: Higher is better. Typical ranges in our smoke set: 55-78% ROUGE-1, 20-55% ROUGE-2, 25-60% ROUGE-L. OpenAI scores highest due to silver-reference bias.

BLEU¶

What it measures: Modified n-gram precision (1- through 4-grams) with a brevity penalty. Originally designed for machine translation.

Why it is a good choice: BLEU complements ROUGE by being precision-oriented (ROUGE is recall-oriented). A summary that is very concise but accurate will score higher on BLEU than on ROUGE. The brevity penalty discourages overly short outputs.

Interpretation: Higher is better. Ranges: 9-50%. Scores below 15% are common for summaries that diverge stylistically from the reference but may still be good.

Embedding cosine similarity¶

What it measures: Cosine similarity between sentence-transformer embeddings of the generated summary and the reference. Uses all-MiniLM-L6-v2 by default.

Why it is a good choice: This is the metric that captures semantic similarity beyond surface words. Two summaries that convey the same meaning using different vocabulary will score high on embedding similarity even if ROUGE/BLEU are low. It is the best single proxy for "do these summaries say the same thing?"

Interpretation: Higher is better. Ranges: 78-93%. Most providers cluster in the 82-88% band, which means they all capture the core meaning reasonably well. Differences below 2% are likely noise.

Coverage ratio¶

What it measures: Length of the generated summary divided by length of the reference summary (in tokens).

Why it is a good choice: Coverage reveals whether a provider is systematically verbose or terse relative to the reference. A coverage of 100% means the output is the same length as the reference. Values above ~110% suggest verbosity; below ~80% suggest the summary may be missing content.

Interpretation: Closer to 100% is better, but the "right" length depends on your use case. A provider at 120% coverage with high ROUGE may simply be more detailed. A provider at 190% coverage (like phi3:mini in our smoke) is likely repeating itself or including extraneous content.

WER (Word Error Rate)¶

What it measures: Word-level edit distance (insertions, deletions, substitutions) normalized by reference length. Borrowed from speech recognition evaluation.

Why it is a good choice: WER is a strict, order-sensitive metric. It penalizes reordering, paraphrasing, and length differences more harshly than ROUGE. This makes it useful as a divergence signal — high WER means the summary is structurally very different from the reference, even if the meaning is preserved.

Interpretation: Lower is better. Ranges: 60-175%. Values above 100% mean the edit distance exceeds the reference length (common when summaries are much longer or use very different vocabulary). WER should not be used alone — pair it with embedding similarity to distinguish "different but good" from "different and bad."

Numbers retained¶

What it measures: Fraction of numeric values (dates, dollar amounts, percentages, counts) from the reference that also appear in the generated summary.

Why it is a good choice: Factual fidelity is critical for podcast summaries that mention statistics, dates, or financial figures. ROUGE and embeddings can miss factual errors because a single wrong digit changes meaning but barely affects n-gram overlap.

Interpretation: Higher is better. This metric is newer and may be null in older runs where it was not yet implemented.

How to interpret results¶

Start with embedding similarity — it is the best single indicator of semantic quality. If a provider scores above 85%, it is capturing the core meaning well.
Check ROUGE-L for structural fidelity — does the summary follow a similar flow to the reference?
Look at coverage — is the provider too verbose or too terse?
Use WER as a warning signal — very high WER (>110%) combined with low embedding similarity suggests poor quality. High WER with high embedding similarity just means the provider has a different writing style.
Remember the silver bias — OpenAI will always score highest because the reference was generated by GPT-4o. Compare non-OpenAI providers against each other for a fairer picture.
Latency is hardware-dependent — cloud API latencies are comparable across runs; Ollama latencies are not (they depend on your GPU/CPU).

Report library¶

Each report below is a self-contained snapshot of one evaluation campaign. Reports are never modified after creation — new measurements get a new report file.

Report	Date	Dataset	Reference	Providers	Highlights
Smoke v1 (2026-03)	Mar 2026	`curated_5feeds_smoke_v1`	`silver_gpt4o_smoke_v1`	6 cloud + 11 Ollama	First full provider sweep; Gemini leads cloud non-OpenAI; Mistral Small 3.2 / Qwen 2.5:32b tie for Ollama lead

How to add a new report¶

Run the experiment(s):

make experiment-run CONFIG=data/eval/configs/<your_config>.yaml \
     REFERENCE=silver_gpt4o_smoke_v1

Create a new report file in this directory:

docs/guides/eval-reports/EVAL_<CAMPAIGN>_<YYYY_MM>.md

Use an existing report as a template. Include: run configs used, full metrics tables, observations, and any methodology changes.

Update the table above — add a row to the "Report library" table in this file.
Update the main guide — if conclusions changed (e.g., a new provider leads), update the takeaways in AI Provider Comparison Guide.

Metrics Guide — full metric catalog (intrinsic + vs-reference)
Experiment Guide — how to run experiments end-to-end
AI Provider Comparison Guide — decision guide with conclusions from these reports
Provider Deep Dives — per-provider reference cards