Skip to content

AI Provider Comparison Guide

Authoritative v2 reference: eval-reports/EVAL_HELDOUT_V2_2026_04.md β€” 6 cloud APIs + 11 Ollama local models, 100+ held-out cells under the v2 framework (RFC-073), compound-scored on quality Γ— latency Γ— cost. v1 benchmark numbers later in this guide are superseded by the v2 report above.


⭐ Use these two. Everything else is a tradeoff

After evaluating 20 model variants across 100+ held-out cells, two picks cover ~95% of real podcast_scraper deployments. Pick one based on whether you need cloud or local.

🌐 Cloud / dev / prod / corpus building β†’ gemini-2.5-flash-lite non-bundled

backend:
  type: "gemini"
  model: "gemini-2.5-flash-lite"
prompts:
  system: "shared/summarization/system_bullets_v1"   # for bullets
  user: "shared/summarization/bullets_json_v1"
  # (or gemini/summarization/{system_v1,long_v1} for paragraph)
params:
  temperature: 0.0

Use this for dev, test, corpus building, and most production traffic.

  • Quality: 0.564 bullets / 0.479 paragraph held-out (within 4% of the absolute best)
  • Latency: 1.5s per episode β€” 2-7Γ— faster than any alternative
  • Cost: ~$0.00047 per episode β€” ~$0.47 per 1,000 episodes
  • Why non-bundled: bundled mode costs 5-12% quality on Gemini and OpenAI for no real gain. Just make two API calls (one for bullets, one for paragraph) β€” still only 3s and $0.00094 total per episode.

Upgrade to deepseek-chat non-bundled only if quality matters more than throughput (0.586/0.541 vs 0.564/0.479, but ~7Γ— slower at 10s/ep). Switch to claude-haiku-4-5 bundled only if you specifically need title+summary+bullets in a single API call (bundled is viable on Anthropic; it's the only cloud provider where that's true).

🏠 Local / offline / privacy β†’ qwen3.5:9b bundled

backend:
  type: "ollama"
  model: "qwen3.5:9b"
llm_pipeline_mode: "bundled"
params:
  temperature: 0.0

Use this for any fully-local, offline, or privacy-constrained deployment.

  • Quality: 0.529 bullets / 0.509 paragraph held-out (free, matches mid-tier cloud)
  • Latency: ~44s per episode for BOTH outputs (vs ~72s for two separate non-bundled calls)
  • Cost: $0 (your hardware)
  • Why bundled: local paragraph contests on 5 of 11 Ollama models non-bundled; bundled mode's JSON schema stabilises structure and makes paragraph output reliable. Single call returns title + summary + bullets.

Upgrade to qwen3.5:9b non-bundled for bullets only if you need absolute max bullets quality locally (0.580 vs 0.529, +10%). Fallback to llama3.2:3b non-bundled bullets if latency matters more than quality (12s vs 33s, quality drops to 0.501).


LLM pipeline modes (llm_pipeline_mode)

Four end-to-end pipelines produce the {title, summary, bullets} artifact plus insights / topics / entities that feed the knowledge graph:

Mode API calls Best for Tier-1 providers
staged (default) 3–4 (summary + GIL + KG/NER) Local/Ollama; when bullets/summary quality must be tuned independently All
bundled 1 summary + 2 extraction Compact cloud runs where summary+bullets fit one call Anthropic, Ollama
extraction_bundled 1 summary + 1 extraction Balanced cloud β€” summary stays in a provider-tuned prompt, extraction collapses into one call All cloud providers
mega_bundled 1 Quality-first cloud, tier-1 providers only β€” single call returns everything Anthropic (tier 1), DeepSeek (tier 2)

Real-episode validation (2026-04-21, #646)

All 6 cloud providers tested end-to-end via scripts/validate/validate_phase3c.py on two real investor-podcast transcripts (7.8 KB short ~8 min, 47 KB medium ~45 min). Every provider passes the call-count / prefilled-propagation / artifact-count gates in mega_bundled mode β€” the #632 "tier-1/tier-2 only" claim did not hold on real production traffic.

Medium transcript (~45 min, representative of production audio):

Provider Model Calls Cost $ Time Ins / Top / Ent Notes
Gemini flash-lite 1 0.00127 17.8 s 12 / 10 / 15 Cost/latency winner; default for cloud_balanced. Occasional 503 UNAVAILABLE (retried).
Mistral small-latest 1 0.00201 8.6 s 12 / 10 / 14 Fastest of all providers.
OpenAI gpt-4o-mini 1 0.00153 23.9 s 12 / 10 / 15 Reliable mid-cost.
DeepSeek deepseek-chat 1 0.00223 56.2 s 12 / 10 / 15 Slowest; needs deepseek_timeout β‰₯ 300 (default 600 s in #646).
Anthropic haiku-4.5 1 0.01530 16.3 s 12 / 10 / 15 Most specific entity names ("Salomon Brothers", "Duke University"); default for cloud_quality.
Grok grok-3-mini 1 0.03346 43.4 s 12 / 10 / 13 25Γ— more expensive than Gemini with no quality edge. Not recommended for volume.

Baseline comparison on Gemini (medium transcript): staged = 3 calls, $0.00445, 13.5 s; bundled = 3 calls, $0.00438, 11.2 s; extraction_bundled = 2 calls, $0.00265, 9.7 s; mega_bundled = 1 call, $0.00127, 17.8 s. Mega-bundled = 72 % cheaper than staged for the same artifact counts.

Profile defaults after validation:

  • cloud_balanced β†’ Gemini flash-lite mega_bundled (cheapest, same quality as extraction_bundled, simpler pipeline).
  • cloud_quality β†’ Anthropic haiku-4.5 mega_bundled (entity F1 = 1.000 per #632, most specific entity names, 15 s latency).

Quality caveat: entity F1 numbers from #632 (Anthropic 1.000 vs DeepSeek 0.88) were on fixture transcripts. Real-episode artifact counts are now comparable across all 6 providers; differences are subtler (name specificity, edge-case entity capture). Re-score with the eval harness if entity fidelity is critical.

Set the mode in your profile:

llm_pipeline_mode: "mega_bundled"        # or: "extraction_bundled", "bundled", "staged"
cloud_llm_structured_min_output_tokens: 4096  # floor for JSON responses (#645)

Full reference β€” v2 matrix details

The two picks above cover most deployments. Everything below is the detailed reference for edge cases and the data that backs those recommendations.

All 6 cloud LLM providers evaluated under the v2 framework: dev/held-out split, fraction-based judge contestation, champion prompts ported across providers. Held-out ROUGE-L vs Sonnet 4.6 silver on curated_5feeds_benchmark_v2 (5 unseen episodes, ~32 min each):

Track OpenAI Anthropic Gemini Mistral DeepSeek Grok
Bullets non-bundled 39.6% 40.7% 40.1% 37.3% 43.1% 38.6%
Bullets bundled 33.2% 39.3% 28.5% 30.4% 34.4% 30.5%
Paragraph non-bundled 31.7% 36.4% 29.3% 27.8% 40.7% 31.3%
Paragraph bundled 29.5% 39.2% 26.6% 30.3% 35.9% 28.9%

Cell winners:

  • Non-bundled (either track): DeepSeek (deepseek-chat). Cheapest cloud non-local option that also leads on both quality metrics β€” the clear sweet spot.
  • Bundled (either track): Anthropic (claude-haiku-4-5). Only provider where bundled is competitive with non-bundled; bundled paragraph even beats non-bundled paragraph.

Local (Ollama) β€” v2 held-out, 11 models evaluated β†’ Core 5 standardized

ADR-077 (2026-04-18): Standardized on 5 models for regular sweeps and pipeline validation. One per family + one large-scale reference: qwen3.5:9b (champion), llama3.2:3b (speed), mistral:7b (mid-tier), gemma2:9b (diversity), qwen3.5:35b (scale reference). Dropped 6 models that were same-family duplicates or structurally unsuitable. See ADR-077 for rationale.

Top 6 local bullets non-bundled (held-out):

Rank Model Size Bullets final
1 qwen3.5:9b 9B 0.580
2 qwen3.5:35b 35B 0.576
3 qwen3.5:27b 27B 0.543
4 mistral-small3.2 ~22B 0.536
5 mistral:7b 7B 0.526
6 llama3.1:8b 8B 0.518

Cross-matrix ranking (bullets non-bundled held-out):

Rank Provider+model Final
1 DeepSeek (cloud) 0.586
2 Ollama qwen3.5:9b (local, free) 0.580
3 Ollama qwen3.5:35b 0.576
4 Anthropic haiku-4.5 0.570
5 OpenAI gpt-4o 0.566

Paragraph β€” use bundled on local (big finding):

Model Non-bundled paragraph Bundled paragraph
qwen3.5:9b 0.505 (2/5 contested) 0.509 (uncontested)
qwen3.5:35b 0.325 (contested β†’ ROUGE-only) 0.492
mistral-small3.2 0.288 (contested β†’ ROUGE-only) 0.468

Non-bundled paragraph contests on 3 of 4 local models (long transcripts β†’ inconsistent structure β†’ judge disagreement). Bundled paragraph doesn't contest at all β€” the JSON schema stabilises output. Bundled is the correct local-deployment choice for paragraph.

Local picks:

  • Bullets (either mode) β†’ qwen3.5:9b non-bundled (0.580).
  • Paragraph β†’ qwen3.5:9b bundled (0.509). Same single call produces both.
  • Single call, title+summary+bullets β†’ qwen3.5:9b bundled. Loses ~9% on bullets vs non-bundled but gains reliability and cost efficiency.
  • No-Ollama-daemon alternative β†’ DISLab/SummLlama3.2-3B via HF transformers directly. v2 held-out paragraph 0.485 (uncontested 0/5, dev 0.442), bullets 0.416 (uncontested 0/5, dev 0.467). Runs via run_summllama_v2.py (HF transformers + MPS / CUDA). DPO-tuned on faithfulness/completeness/conciseness β€” the same Llama-3.2-3B base that scores only 0.270 paragraph standalone lifts to 0.485 with alignment. Paragraph-strong, bullets- weaker (DPO was prose-shaped, not list-shaped). Latency 60-156s/ep on Apple MPS, slower than Ollama but operationally simpler (no daemon, one Python process). Pick this for paragraph-first deployments or when Ollama can't be run. For bullet-heavy workloads, qwen3.5:9b bundled stays the better local pick. See Held-out v2 report Β§6a.

Default picks by use case (compound-scored across quality, latency, cost β€” see Held-out v2 report Β§Compound analysis):

Priority Best pick Why
Balanced default Gemini 2.5-flash-lite non-bundled 0.564 / 0.479, 1.5s, ~$0.00047/ep. New 2026-04-16 β€” strict upgrade over 2.0-flash.
Quality first DeepSeek non-bundled 0.586 (#1), 10s, $0.0005/ep. Top quality at near-bottom cost.
Single-call bundled Anthropic haiku-4.5 bundled 0.552 bullets / 0.548 paragraph, 7s, $0.006/ep. Only provider where bundled is competitive.
Throughput / real-time Gemini 2.5-flash-lite 1.5s/ep β€” fastest in the matrix. 2Γ— faster than Gemini 2.0-flash at same quality tier.
OpenAI-ecosystem (cost-sensitive) gpt-4o-mini 0.540 / 0.469, 6.6s, ~$0.00074/ep. 16Γ— cheaper than gpt-4o for 4% quality hit.
Privacy / offline Ollama qwen3.5:9b 0.580 (local #1), 33s/ep, free. Matches DeepSeek quality offline.

Avoid / dominated (unless ecosystem-locked):

  • OpenAI gpt-4o: Anthropic haiku-4.5 is better on quality AND ~3Γ— cheaper.
  • Grok: slower than Anthropic/Gemini without compensating quality.
  • OpenAI bundled, Gemini bundled, Mistral non-bundled paragraph, local paragraph on long transcripts: structural weak spots visible in the matrix.
  • Ollama qwen3.5:27b / qwen3.5:35b: larger but not better than qwen3.5:9b.

See v2 eval report for blended scores, dev numbers, generalisation analysis, and provider-specific quirks.

Full pipeline validation (2026-04-18, PR #603)

All providers tested through the complete pipeline (summary β†’ GI β†’ KG β†’ bridge) on 5 held-out episodes. make pipeline-validate verifies each stage produces valid output.

Provider Summary GI Grounding KG Bridge
openai/gpt-4o-mini βœ… βœ… 100% βœ… βœ…
gemini/flash-lite βœ… βœ… 98% βœ… βœ…
anthropic/haiku-4.5 βœ… βœ… 100% βœ… βœ…
deepseek/deepseek-chat βœ… βœ… 100% βœ… βœ…
mistral/mistral-small βœ… βœ… 100% βœ… βœ…
grok/grok-3-mini βœ… βœ… 100% βœ… βœ…
ollama/qwen3.5:9b βœ… βœ… 100% βœ… βœ…
ollama/llama3.1:8b βœ… βœ… βœ… βœ… βœ…
ollama/mistral:7b βœ… βœ… 98% βœ… βœ…
ollama/gemma2:9b βœ… ⚠️ 95% βœ… βœ…
ollama/qwen3.5:35b βœ… βœ… 98% βœ… βœ…

All 11 providers pass the full pipeline. gemma2:9b is borderline on GI insight count (7.8/ep vs 8 threshold, instruction-following gap). Minimum viable model size for full pipeline: 7-8B (3B models fail on KG entity extraction). See ADR-077.


Implementation Status

All providers below are implemented and acceptance-tested (v2.4.0+).

Provider Status RFC Notes
Local ML Implemented - Default provider (Whisper + spaCy + Transformers): transcription, speaker detection, summarization
Hybrid ML Implemented RFC-042 Summarization only: MAP (LongT5) + REDUCE (transformers / Ollama / llama_cpp)
OpenAI Implemented RFC-013 Transcription + summarization (Whisper API + GPT API)
Gemini Implemented RFC-035 Transcription + summarization (no speaker detection)
Mistral Implemented RFC-033 Summarization only (EU data residency)
Anthropic Implemented RFC-032 Summarization only (no transcription or speaker detection)
DeepSeek Implemented RFC-034 Summarization only; ultra low-cost
Grok Implemented RFC-036 Summarization only; real-time information access
Ollama Implemented RFC-037 Transcription, speaker detection, summarization; local self-hosted LLMs, zero cost, complete privacy

For hybrid_ml (MAP-REDUCE) configuration and REDUCE backends (Ollama, llama_cpp, transformers), see ML Provider Reference and Configuration API. Transcript cleaning: hybrid_ml honors transcript_cleaning_strategy like API providers; with pattern, use hybrid_internal_preprocessing_after_pattern (CLI: --hybrid-internal-preprocessing-after-pattern) to control internal MAP preprocessing after workflow cleaning (RFC-042). LLM / hybrid LLM stage: long transcripts can hit the length-ratio guard (pattern-cleaned input β‰₯2000 chars and LLM output below 20% of that length is discarded); see CONFIGURATION β€” LLM cleaning length guard.


Key Statistics at a Glance

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        PROVIDER LANDSCAPE OVERVIEW                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  9 Summarization Options  β”‚  (Hybrid = MAP+REDUCE)  β”‚  3 Full-Stack Ready  β”‚
β”‚  ════════════════════     β”‚  ═══════════════════════ β”‚  ═══════════════     β”‚
β”‚  Local ML              β”‚  Hybrid ML (RFC-042)  β”‚  Local ML          β”‚
β”‚  Hybrid ML             β”‚  MAP + Ollama/llama_cpp β”‚  OpenAI (tx+sum)  β”‚
β”‚  OpenAI                β”‚  or transformers REDUCE  β”‚  Ollama            β”‚
β”‚  Gemini                β”‚                         β”‚                      β”‚
β”‚  Mistral               β”‚                         β”‚                      β”‚
β”‚  Anthropic / DeepSeek  β”‚                         β”‚                      β”‚
β”‚  Grok / Ollama         β”‚                         β”‚                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                           COST SPECTRUM (per 100 episodes)                  β”‚
β”‚                                                                             β”‚
β”‚  $0 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ $37  β”‚
β”‚  β”‚                                                                     β”‚   β”‚
β”‚  β–Ό                                                                     β–Ό   β”‚
β”‚  Local/Ollama                                               OpenAI (full) β”‚
β”‚  ($0)                                                             ($37)    β”‚
β”‚                                                                             β”‚
β”‚  DeepSeek ─── Grok ─── Anthropic ─── Gemini ─── OpenAI (text) ─── OpenAI  β”‚
β”‚   ($0.02)    ($0.03)    ($0.40)      ($0.95)    ($0.55)           ($37)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Decision Matrix

Updated picks (v2 data, 2026-04-16): Default local-LLM is now qwen3.5:9b (not llama3.2:3b). Hybrid ML is demoted from default recommendation to narrow niche β€” v2 showed standalone qwen3.5:9b (0.509 paragraph held-out) beats hybrid bart+qwen3.5:9b (0.448) for our workload.

If you need... Choose Why
Complete Privacy Local ML / Ollama Data never leaves your device
Lowest Cost Local ML / Ollama $0 (just electricity)
Air-gapped (no Ollama) Local ML (ml_bart_led_autoresearch_v1) v2 held-out 0.206 (weak but zero-deps floor)
Air-gapped + Ollama Ollama (qwen3.5:9b bundled) v2 held-out 0.509 paragraph / 0.529 bullets β€” the recommended local pick
Highest Quality (cloud) DeepSeek (non-bundled) v2 held-out 0.586 bullets / 0.541 paragraph β€” top of matrix
Fastest Cloud Gemini 2.5-flash-lite 1.5s/ep β€” fastest in v2 matrix
On-prem, quality first Ollama qwen3.5:9b bundled Best local quality (0.509 paragraph, uncontested)
On-prem, speed first Ollama (llama3.2:3b) 12s/ep, quality floor 0.501 bullets
On-prem, bullets-only max quality Ollama (qwen3.5:9b non-bundled) 0.580 held-out (2nd overall, beats most cloud)
On-prem, no Ollama daemon, paragraph-first HF transformers (DISLab/SummLlama3.2-3B) 0.485 held-out paragraph / 0.416 bullets β€” DPO-tuned 3B via HF transformers on MPS/CUDA directly; operationally simpler than Ollama. Paragraph-strong, bullets-weaker (DPO was prose-shaped).
Full Capabilities OpenAI / Local ML All 3 capabilities (transcription + speaker + summary)
Hybrid MAP-REDUCE Hybrid ML (Ollama/llama_cpp) Retained as niche option (RFC-042); not recommended as default β€” see v2 findings
Real-Time Info Grok Real-time information access (RFC-036)
Lowest Cloud Cost Gemini 2.5-flash-lite / DeepSeek ~$0.0005/ep both β€” comparable
EU Data Residency Mistral European servers (RFC-033)
Huge Context Gemini 2 million token window (RFC-035)
Free Development Gemini / Grok Generous free tiers (RFC-035, RFC-036)
Self-Hosted Ollama Offline/air-gapped (RFC-037)

Screenplay formatting vs transcription (GitHub #562)

transcription_provider screenplay: true in transcript body
whisper (local) Yes β€” gap-based speaker labels via MLProvider
openai (whisper-1 API) No β€” plain text transcript; segment JSON may still exist
gemini / mistral No β€” plain text in our integration (provider wiring here)

Validation coerces truthy screenplay to false when the transcription provider is not whisper, with a single INFO while a process gate is set (the gate resets when each run_pipeline finishes so another Config build can log again). Details: CONFIGURATION.md β€” Screenplay vs transcription.


Empirical Highlights

All claims below are backed by measured data. For the full metrics tables, methodology, and metric definitions, see the Evaluation Reports.

Note on the silver reference: Results were re-measured in April 2026 against silver_sonnet46_benchmark_v1 (Claude Sonnet 4.6, 10-episode benchmark scale). Rankings shifted significantly from March 2026 β€” see why the rankings changed. The March 2026 numbers (vs GPT-4o silver) are preserved in the March report for reference.

Full quality ladder β€” all four tiers

Every summarization option in one view, ordered by ROUGE-L. ML/hybrid numbers are smoke-scale (5 eps); cloud and Ollama numbers are benchmark-scale (10 eps).

All numbers benchmark-scale (10 eps, curated_5feeds_benchmark_v1 vs silver_sonnet46_benchmark_v1).

Tier Mode ROUGE-L Embed Lat/ep Dependencies
1 β€” ML Dev ml_small_authority 19.1% 70.0% fast None (CI safe)
2 β€” ML Prod ml_bart_led_autoresearch_v1 20.5% 68.2% 26s None (air-gap safe)
β€” Hybrid ml_hybrid_bart_llama32_3b_autoresearch_v1 21.1% 76.6% 15s Ollama (3B only)
3 β€” LLM Local (small) llama3.2:3b direct 24.4% 78.6% 8.5s Ollama
3 β€” LLM Local (large) qwen3.5:35b direct 31.9% 81.5% 21s Ollama
4 β€” LLM Cloud (mid) Gemini 2.0 Flash 28.7% 82.5% 2.7s API key
4 β€” LLM Cloud (mid) DeepSeek 29.5% 83.6% 8.9s API key
4 β€” LLM Cloud (best) Anthropic Haiku 4.5 33.7% 86.2% 5.0s API key

Key observations:

  • The jump from ML-prod (20.5%) to direct-LLM (24.4%) is ~4 ROUGE-L points. The hybrid (21.1%) closes only ~1.5 of those points at benchmark scale β€” less than smoke suggested (23.7%). The gap exists because temperature=0.5 sampling variance averages down over 10 episodes. The hybrid is still valuable for long transcripts (BART MAP chunks arbitrary-length input).
  • qwen3.5:35b (31.9%) is the only on-prem model in the cloud quality range β€” it exceeds Gemini, OpenAI, and Grok.
  • The hybrid is the right choice when: transcripts exceed LLM context windows, Ollama is available but only a small model fits in VRAM, or quality must improve over ML-prod without paying for cloud.

Cloud providers β€” paragraphs (vs Sonnet 4.6 silver, April 2026)

Best cloud provider: Anthropic (claude-haiku-4-5) β€” leads on ROUGE-L and embedding similarity across both smoke (5 eps) and benchmark (10 eps) runs. Gemini (gemini-2.0-flash) remains the fastest cloud option. Numbers below are benchmark (10-episode, more stable).

Provider ROUGE-L Embed Latency
Anthropic 33.7% 86.2% 5.0s
DeepSeek 29.5% 83.6% 8.9s
Gemini 28.7% 82.5% 2.7s
Mistral 28.0% 82.3% 4.6s
OpenAI 26.8% 84.1% 8.5s
Grok 26.7% 81.7% 7.5s

The Anthropic model used here is claude-haiku-4-5 (smallest/fastest Haiku). The silver reference is Claude Sonnet 4.6 β€” Anthropic scores well partly because the models share a generation family.

Full table: Benchmark v1 report β€” Cloud providers

Local Ollama β€” paragraphs (vs Sonnet 4.6 silver, April 2026)

Best local model: Qwen 3.5:35b at 31.9% ROUGE-L (21s/ep) β€” the only on-prem model above the cloud median, competitive with Gemini and Mistral API. Mistral Small 3.2 (28.1%, 89s/ep) is a strong mid-tier option. llama3.2:3b (24.4%, 8.5s) is the best fast/low-resource choice. Numbers below are benchmark scale (10 eps).

Model ROUGE-L Embed Latency
qwen3.5:35b 31.9% 81.5% 21s
mistral-small3.2 28.1% 81.4% 89s
qwen2.5:32b 24.6% 80.7% 78s
qwen3.5:9b 25.7% 78.0% 226s†
llama3.2:3b 24.4% 78.6% 8.5s

Latencies are hardware-dependent (Apple M-series). Re-run on your machine. †qwen3.5:9b and qwen3.5:27b showed CPU-offload latency anomalies in the benchmark run.

Local Ollama β€” bullets (vs Sonnet 4.6 bullets silver, April 2026)

For bullet JSON output, qwen3.5:35b leads at benchmark scale (36.2% ROUGE-L, 14s/ep). llama3.2:3b is the fastest option (33.6% ROUGE-L, 5.2s/ep). qwen2.5:7b does not reliably follow the JSON format β€” avoid it for the bullets track.

Model ROUGE-L Embed Latency
qwen3.5:35b 36.2% 87.3% 14.1s
qwen3.5:27b 35.2% 88.4% 63.2s
mistral-small3.2 34.2% 84.3% 39.9s
llama3.2:3b 33.6% 82.9% 5.2s
qwen3.5:9b 32.6% 83.5% 16.7s

Full tables: Benchmark v1 report (April 2026)


Detailed Cost Analysis

Per 100 Episodes β€” Complete Breakdown

Provider Transcription Speaker Summary Total vs OpenAI
Local ML $0 $0 $0 $0 -100%
Ollama N/A $0 $0 $0 -100%
DeepSeek N/A N/A $0.016 $0.016 -97%
Grok (beta) N/A N/A $0.00 $0.00 -100%
Mistral (Small) N/A N/A $0.11 $0.11 -80%
Anthropic (Haiku) N/A N/A $0.40 $0.40 -27%
Gemini (Flash) $0.90 N/A $0.05 $0.95 +73%
OpenAI (Nano) $36.00 N/A $0.28 $36.28 baseline
OpenAI (Mini) $36.00 N/A $1.40 $37.40 +3%
Mistral (Large) N/A N/A $9.00 $9.00 -75%

Monthly Cost Projections

Monthly costs at different scales
═══════════════════════════════════════════════════════════════════════════

                    100 ep/month        1,000 ep/month      10,000 ep/month
                    ────────────        ──────────────      ───────────────
Local ML            $0                  $0                  $0
DeepSeek            $0.02               $0.16               $1.60
Grok                $0.03               $0.26               $2.60
Anthropic           $0.40               $4.00               $40.00
OpenAI (text only)  $0.55               $5.50               $55.00
OpenAI (full)       $37.40              $374.00             $3,740.00

  At 10,000 episodes/month, OpenAI full stack costs $3,740!
    Using local transcription + DeepSeek: $1.60 (99.96% savings)

Key insight: Transcription dominates cloud costs (90%+). Use local Whisper + cloud text processing to save massively.


Decision Flowchart

                            START
                              β”‚
                              β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  What's your    β”‚
                    β”‚  TOP priority?  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                    β”‚                    β”‚
        β–Ό                    β–Ό                    β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ PRIVACY β”‚         β”‚  COST   β”‚         β”‚ QUALITY β”‚
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
        β”‚                   β”‚                   β”‚
        β–Ό                   β–Ό                   β–Ό
   Need transcription?  Need transcription?  Budget matters?
        β”‚                   β”‚                   β”‚
   β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
   β”‚Yes  β”‚No β”‚         β”‚Yes  β”‚No β”‚         β”‚Yes  β”‚No β”‚
   β–Ό     β–Ό   β–Ό         β–Ό     β–Ό   β–Ό         β–Ό     β–Ό   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”
β”‚LOCAL β”‚ β”‚OLLAMAβ”‚  β”‚LOCAL β”‚ β”‚DEEP  β”‚  β”‚GPT-5 β”‚ β”‚GPT-5 β”‚
β”‚  ML  β”‚ β”‚      β”‚  β”‚Whisperβ”‚ β”‚SEEK  β”‚  β”‚ Mini β”‚ β”‚      β”‚
β”‚      β”‚ β”‚      β”‚  β”‚  +    β”‚ β”‚      β”‚  β”‚      β”‚ β”‚      β”‚
β”‚      β”‚ β”‚      β”‚  β”‚DeepSk β”‚ β”‚      β”‚  β”‚      β”‚ β”‚      β”‚
β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”˜

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                    β”‚                    β”‚
        β–Ό                    β–Ό                    β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  SPEED  β”‚         β”‚ CONTEXT β”‚         β”‚   EU    β”‚
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
        β”‚                   β”‚                   β”‚
        β–Ό                   β–Ό                   β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  GROK   β”‚         β”‚ GEMINI  β”‚         β”‚ MISTRAL β”‚
   β”‚         β”‚         β”‚   Pro   β”‚         β”‚         β”‚
   β”‚ Real-Timeβ”‚        β”‚   2M    β”‚         β”‚  Full   β”‚
   β”‚ faster  β”‚         β”‚ tokens  β”‚         β”‚  Stack  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Autoresearch-derived defaults (2026-04)

These are the research-backed settings used across config/profiles/ (main profiles + config/profiles/freeze/*.yaml per-provider captures). Source data in docs/guides/eval-reports/ and data/eval/runs/ner_*_smoke_v1/.

Since 2026-04, the defaults are split across named presets so a single edit updates every deployment profile that references them (GitHub #634):

  • Audio preprocessing: config/profiles/audio/speech_optimal_v1.yaml β€” referenced by all 5 deployment profiles via audio_preprocessing_profile: speech_optimal_v1. See the file's own comment block for rationale and data; see config/profiles/audio/README.md for the pattern (both live under config/, outside the MkDocs tree).
  • Text cleaning (ML-only): ml_preprocessing_profile on Config (e.g. cleaning_v4), overrides the mode_cfg.preprocessing_profile default in the ML summary registry. Cloud LLM and Ollama providers send raw transcripts and ignore this field.

NER / speaker detection

Smoke eval (5 episodes, 15 gold entities, data/eval/runs/ner_*_smoke_v1/):

Backend F1 Notes
spaCy trf (en_core_web_trf) 1.000 Free, local, deterministic
spaCy sm (en_core_web_sm) 0.966 Faster, slightly lower recall
OpenAI, Anthropic, Gemini, Mistral, DeepSeek, Grok 1.000 All six cloud LLMs tie spaCy trf
Ollama qwen3.5:9b 0.750 Misses 6/15 entities; weakest tested

Default for main profiles: speaker_detector_provider: spacy + ner_model: en_core_web_trf. Ties every cloud LLM on the available data while saving the API call. Post-reingestion validation on a larger/harder corpus is tracked in POST_REINGESTION_PLAN.md Step 6.

Exception β€” per-provider freeze profiles (config/profiles/freeze/<name>.yaml): these intentionally route NER through <provider> to exercise each provider's full surface in profile/cost capture runs.

All 6 cloud providers + Ollama have detect_speakers wired (see src/podcast_scraper/providers/*/...). Earlier "summarization only" notes in this guide are obsolete.

Grounded Insights (GI)

  • gi_insight_source: provider (not summary_bullets) β†’ +10pp insight coverage vs silver. Used by cloud_balanced, cloud_quality, local. airgapped keeps summary_bullets because SummLlama3.2-3B is summary-only, not chat.
  • gi_max_insights: 12 β€” autoresearch sweet spot; default is 20, which is long-tail and hurts precision.
  • gi_require_grounding: true (default) β€” drops insights without grounded quote evidence.

Knowledge Graph (KG)

  • kg_extraction_source: provider β†’ +37pp topic coverage vs silver (measured on KG pipeline, not summary-bullet proxy).
  • kg_max_topics: 10 β€” autoresearch sweet spot; default is 20.
  • kg_max_entities: 15 β€” matches default; standardised across profiles.
  • KG v3 prompt (noun-phrase enforcement, promoted in #625 / PR #628) β€” already the default.

Insight clustering

  • Default threshold 0.75 (src/podcast_scraper/search/insight_clusters.py). Configured at call site, not via profile YAML.

Pipeline mode

  • llm_pipeline_mode: bundled β€” only for local qwen3.5:9b: bundled schema stabilises paragraph output for Ollama (+structure reliability).
  • llm_pipeline_mode: staged (default) β€” for Gemini, OpenAI, Anthropic (except Haiku bundled edge case), DeepSeek, Mistral, Grok. On Gemini and OpenAI, bundled costs 5-12% quality for no real gain.

Transcription

  • whisper_model: small.en β€” 9.5% WER sweet spot, 148s/ep on CPU. Use this whenever the provider doesn't offer its own transcription.
  • whisper_model: base.en β€” 40s/ep, ~13% WER. Faster but noticeably worse quality; reserved for dev profile.
  • Providers with native transcription: OpenAI (whisper-1), Gemini (gemini-2.5-flash-lite), Mistral (voxtral-mini-latest). For these, per-provider capture profiles use the provider's own transcription path.
  • Anthropic, DeepSeek, Grok, Ollama have no transcription β€” fall back to local Whisper (small.en).

Transcription head-to-head (#577 Exp 3, 2026-04-20, 5 NPR eps ~30 min each)

Audio fixed at 32 kbps / 16 kHz / mono (Exp 1 winner). Reference transcripts from curated_5feeds_benchmark_v2. Local Whisper small.en baseline WER ~11%.

Provider / Model WER (avg) $/ep (32-min) Wall Notes
openai/whisper-1 8.2% $0.20 68s No caps; hard-coded anti-loop filters make it the most stable cloud option
mistral/voxtral-mini-latest 8.6% (clean) $0.034 21s 6Γ— cheaper than whisper-1 but 1/5 eps hallucinated (109K-char loop); no native anti-loop filters. Requires temperature=0.0 (applied) + output-length sanity check (applied) + pre-chunk to <25 min. New model voxtral-mini-transcribe-2 (Feb 2026) ships with diarization/timestamps and is expected to replace voxtral-mini-latest for batch transcription β€” upgrade tracked separately.
openai/gpt-4o-transcribe β€” β€” β€” Hard 1400s (23 min) duration cap. All 5 eps failed. Needs chunking (see #286).
openai/gpt-4o-mini-transcribe β€” β€” β€” Token budget cap (narrower than the 1400s cap). All 5 eps failed. Needs chunking.
gemini/gemini-2.5-flash-lite 72–931% $0.01 16–121s Not suitable for verbatim transcription. LLM-based audio without anti-loop filters. At default max_output_tokens=8192 silently truncates β†’ summary-style (72% WER). At raised cap, runs the hallucination loop (931% WER). Use gemini-2.5-pro or -flash with thinking for better results if Gemini is required; otherwise prefer Whisper/Voxtral.

Cost lever hierarchy for Whisper API path

  1. File-size vs duration β€” API cost is duration-based. Bitrate affects file size (upload speed, 25 MB cap) but not per-minute cost. Lower bitrate is still worth it for smaller files (Exp 1: 32 kbps is the sweet spot).
  2. Silence trim β€” direct duration cut. On tightly-edited benchmark fixtures the filter removed 0% (fixtures have no silences > 1 s at any threshold). On production NPR audio (#577 Exp 2-prod, 2026-04-20, 2 Γ— 32-min episodes): -50 dB / 2.0 s (the previous default) trims 0%. -30 dB / 0.5 s trims 3.6% with < 1% transcript char drop. -25 dB / 0.5 s trims 6.7% with similar char drop (held pending validation on more diverse podcast types). speech_optimal_v1.yaml now uses -30 dB / 0.5 s as the moderate sweet spot β€” across all 5 deployment profiles that's a direct 3.6% API $ saving on top of whatever provider they pick.
  3. Cheaper model β€” voxtral-mini-latest is 6Γ— cheaper at similar clean-run quality (pending hallucination mitigations). gpt-4o-mini-transcribe is 50% cheaper than whisper-1 but blocked on chunking.

Cross-provider caps and constraints

Model File size Duration Token budget Notes
openai/whisper-1 25 MB β€” β€” Most permissive OpenAI option
openai/gpt-4o-transcribe β€” 1400s β€” Hard duration cap
openai/gpt-4o-mini-transcribe β€” β€” yes (tighter than 1400s) Needs chunking for any real podcast ep
mistral/voxtral-mini-latest n/a 30 min (documented ceiling) β€” Exceeding 30 min triggers hallucination loops
gemini/gemini-2.5-flash-lite inline ~20 MB (Files API β‰₯ 20 MB) β€” 8192 output default, 65536 cap Silent summary-truncation at default cap

Local vs API breakeven β€” when to pick which (#577 Exp 4)

Inputs (measured, 32-min NPR episode, 32 kbps input):

Path Wall per ep $ per ep Throughput (parallel)
Local Whisper small.en on MPS ~100s $0.00 1 ep at a time (MPS shared)
Cloud whisper-1 ~68s $0.20 50+ concurrent (tier-1 rate limit)
Cloud voxtral-mini-latest (clean runs) ~21s $0.034 10+ concurrent

Decision rules:

  • Volume < 100 eps/day AND you're not time-sensitive β†’ local wins. Zero cost; the 100s/ep on MPS is fine to run overnight. API $ savings (at most $20/day on whisper-1) don't justify the API complexity or hallucination risk.
  • Volume 100–1,000 eps/day OR you need results inside an hour β†’ cloud API wins on wall-clock time. Pick whisper-1 for reliability (no chunking needed, no hallucination surprises). Expect $20–$200/day.
  • Volume > 1,000 eps/day OR cost-sensitive at scale β†’ voxtral-mini with hallucination mitigations applied (temperature=0.0, pre-chunk to < 25 min, post-hoc length sanity check with fallback to whisper-1). 6Γ— cheaper than whisper-1; expected ~$35/1,000 eps net after fallback overhead.

Pathological case β€” a single ~3,000-word episode as fast as possible: voxtral-mini-latest wins wall-time (~21s). whisper-1 is the conservative alternative (~68s, predictable). Both beat local Whisper (~100s sequential).

Break-even table (rough, 32-min NPR avg episode):

Daily volume Local (1Γ— MPS) whisper-1 (50 parallel) voxtral-mini (10 parallel + fallback)
10 eps 17 min, $0 14s, $2 21s, $0.40
100 eps 2.8 hours, $0 2.3 min, $20 3.5 min, $4
1,000 eps 28 hours, $0 23 min, $200 35 min, $40
10,000 eps 11.5 days, $0 3.8 hours, $2,000 5.8 hours, $400

Rule of thumb: once you need throughput above 1 ep/min sustained, local on a single Mac is out. Cloud API cost becomes the binding constraint, which is why voxtral-mini with mitigations is the cost-optimal long-horizon pick.


Configuration 1: Ultra-Budget ($0.016/100 episodes)

# 97% cheaper than OpenAI. DeepSeek has detect_speakers wired but NER smoke
# has spaCy trf tied at F1=1.000, so local spaCy is cheaper and equivalent.
transcription_provider: whisper             # Free (local; DeepSeek no transcription API)
whisper_model: small.en                     # Production quality
speaker_detector_provider: spacy            # Free, local; F1=1.000 smoke
ner_model: en_core_web_trf
summary_provider: deepseek                  # $0.016/100, leads v2 bullets
deepseek_summary_model: deepseek-chat
deepseek_api_key: ${DEEPSEEK_API_KEY}

Configuration 2: Quality-First (~$42/100 episodes)

# Maximum quality via OpenAI (transcription + summarization).
transcription_provider: openai
openai_transcription_model: whisper-1
speaker_detector_provider: spacy            # F1 tied; skip the API call
ner_model: en_core_web_trf
summary_provider: openai
openai_summary_model: gpt-4o-mini           # v2 eval model; gpt-4o for top quality
openai_api_key: ${OPENAI_API_KEY}

Configuration 3: Privacy-First ($0)

# Data never leaves your device.
transcription_provider: whisper             # Local
whisper_model: small.en                     # 9.5% WER sweet spot
speaker_detector_provider: spacy            # Local spaCy trf > qwen3.5:9b NER (F1 1.0 vs 0.75)
ner_model: en_core_web_trf
summary_provider: ollama                    # Local Ollama
ollama_summary_model: qwen3.5:9b            # v2 local champion
llm_pipeline_mode: bundled                  # Stabilises Ollama JSON output

Configuration 4: Speed-First (~$0.25/100 episodes)

# Fast cloud summarization via Grok.
transcription_provider: whisper             # Local (Grok no transcription API)
whisper_model: small.en
speaker_detector_provider: spacy            # Free, tied F1 on smoke
ner_model: en_core_web_trf
summary_provider: grok
grok_summary_model: grok-3-mini             # Eval-validated; grok-2/grok-beta are stale IDs
grok_api_key: ${GROK_API_KEY}

Configuration 5: EU Compliant (Mistral)

# European data residency end-to-end (Mistral has its own transcription).
transcription_provider: mistral
mistral_transcription_model: voxtral-mini-latest
speaker_detector_provider: spacy            # Free, tied F1 on smoke
ner_model: en_core_web_trf
summary_provider: mistral
mistral_summary_model: mistral-large-latest
mistral_api_key: ${MISTRAL_API_KEY}

Configuration 6: Free Development (~$0)

# Maximize free tiers. Gemini free tier covers transcription + summary.
transcription_provider: gemini              # Gemini has audio input
gemini_transcription_model: gemini-2.5-flash-lite
speaker_detector_provider: spacy            # Free, tied F1 on smoke
ner_model: en_core_web_trf
summary_provider: gemini
gemini_summary_model: gemini-2.5-flash-lite
gemini_api_key: ${GEMINI_API_KEY}

Summary β€” v2 held-out key takeaways

See eval-reports/EVAL_HELDOUT_V2_2026_04.md for the full matrix. Headline findings:

Axis Winner Score / note
Cheapest cloud DeepSeek $0.016/100 eps, leads bullets non-bundled (43.1%)
Compound-scored cloud default Gemini 2.5-flash-lite Non-bundled: 0.564/0.479 bullets/para, 1.5s/ep, $0.47/1k eps
Best cloud bundled Anthropic claude-haiku-4-5 Only provider where bundled is competitive (39.3% bullets, 39.2% para)
Best local qwen3.5:9b bundled 0.529/0.509 bullets/para, ~44s/ep, $0
EU residency Mistral End-to-end: voxtral-mini-latest transcription + mistral-large-latest
Real-time info Grok X/Twitter integration
Local bullets leader (non-bundled) qwen3.5:9b 0.580 β€” beats qwen3.5:35b (0.576) at 1/4 the size

Cost insight: transcription is 90%+ of cloud pipeline cost. Local Whisper small.en + cloud summarization is the high-leverage combination. Per-provider cost numbers are in EVAL_HELDOUT_V2; older v1 numbers in this guide are superseded.

Rankings change when the silver reference changes β€” Sonnet 4.6 silver favours verbose paragraph style; different silvers may produce different orderings. See eval methodology for detail.