AI Provider Comparison Guide¶
Authoritative v2 reference:
eval-reports/EVAL_HELDOUT_V2_2026_04.mdβ 6 cloud APIs + 11 Ollama local models, 100+ held-out cells under the v2 framework (RFC-073), compound-scored on quality Γ latency Γ cost. v1 benchmark numbers later in this guide are superseded by the v2 report above.
β Use these two. Everything else is a tradeoff¶
After evaluating 20 model variants across 100+ held-out cells, two picks cover ~95% of real podcast_scraper deployments. Pick one based on whether you need cloud or local.
π Cloud / dev / prod / corpus building β gemini-2.5-flash-lite non-bundled¶
backend:
type: "gemini"
model: "gemini-2.5-flash-lite"
prompts:
system: "shared/summarization/system_bullets_v1" # for bullets
user: "shared/summarization/bullets_json_v1"
# (or gemini/summarization/{system_v1,long_v1} for paragraph)
params:
temperature: 0.0
Use this for dev, test, corpus building, and most production traffic.
- Quality: 0.564 bullets / 0.479 paragraph held-out (within 4% of the absolute best)
- Latency: 1.5s per episode β 2-7Γ faster than any alternative
- Cost: ~$0.00047 per episode β ~$0.47 per 1,000 episodes
- Why non-bundled: bundled mode costs 5-12% quality on Gemini and OpenAI for no real gain. Just make two API calls (one for bullets, one for paragraph) β still only 3s and $0.00094 total per episode.
Upgrade to deepseek-chat non-bundled only if quality matters more than throughput (0.586/0.541 vs 0.564/0.479, but ~7Γ slower at 10s/ep). Switch to claude-haiku-4-5 bundled only if you specifically need title+summary+bullets in a single API call (bundled is viable on Anthropic; it's the only cloud provider where that's true).
π Local / offline / privacy β qwen3.5:9b bundled¶
backend:
type: "ollama"
model: "qwen3.5:9b"
llm_pipeline_mode: "bundled"
params:
temperature: 0.0
Use this for any fully-local, offline, or privacy-constrained deployment.
- Quality: 0.529 bullets / 0.509 paragraph held-out (free, matches mid-tier cloud)
- Latency: ~44s per episode for BOTH outputs (vs ~72s for two separate non-bundled calls)
- Cost: $0 (your hardware)
- Why bundled: local paragraph contests on 5 of 11 Ollama models non-bundled; bundled mode's JSON schema stabilises structure and makes paragraph output reliable. Single call returns title + summary + bullets.
Upgrade to qwen3.5:9b non-bundled for bullets only if you need absolute max bullets quality locally (0.580 vs 0.529, +10%). Fallback to llama3.2:3b non-bundled bullets if latency matters more than quality (12s vs 33s, quality drops to 0.501).
LLM pipeline modes (llm_pipeline_mode)¶
Four end-to-end pipelines produce the {title, summary, bullets} artifact plus
insights / topics / entities that feed the knowledge graph:
| Mode | API calls | Best for | Tier-1 providers |
|---|---|---|---|
staged (default) |
3β4 (summary + GIL + KG/NER) | Local/Ollama; when bullets/summary quality must be tuned independently | All |
bundled |
1 summary + 2 extraction | Compact cloud runs where summary+bullets fit one call | Anthropic, Ollama |
extraction_bundled |
1 summary + 1 extraction | Balanced cloud β summary stays in a provider-tuned prompt, extraction collapses into one call | All cloud providers |
mega_bundled |
1 | Quality-first cloud, tier-1 providers only β single call returns everything | Anthropic (tier 1), DeepSeek (tier 2) |
Real-episode validation (2026-04-21, #646)¶
All 6 cloud providers tested end-to-end via scripts/validate/validate_phase3c.py
on two real investor-podcast transcripts (7.8 KB short ~8 min, 47 KB medium ~45 min).
Every provider passes the call-count / prefilled-propagation / artifact-count gates
in mega_bundled mode β the #632 "tier-1/tier-2 only" claim did not hold on
real production traffic.
Medium transcript (~45 min, representative of production audio):
| Provider | Model | Calls | Cost $ | Time | Ins / Top / Ent | Notes |
|---|---|---|---|---|---|---|
| Gemini | flash-lite | 1 | 0.00127 | 17.8 s | 12 / 10 / 15 | Cost/latency winner; default for cloud_balanced. Occasional 503 UNAVAILABLE (retried). |
| Mistral | small-latest | 1 | 0.00201 | 8.6 s | 12 / 10 / 14 | Fastest of all providers. |
| OpenAI | gpt-4o-mini | 1 | 0.00153 | 23.9 s | 12 / 10 / 15 | Reliable mid-cost. |
| DeepSeek | deepseek-chat | 1 | 0.00223 | 56.2 s | 12 / 10 / 15 | Slowest; needs deepseek_timeout β₯ 300 (default 600 s in #646). |
| Anthropic | haiku-4.5 | 1 | 0.01530 | 16.3 s | 12 / 10 / 15 | Most specific entity names ("Salomon Brothers", "Duke University"); default for cloud_quality. |
| Grok | grok-3-mini | 1 | 0.03346 | 43.4 s | 12 / 10 / 13 | 25Γ more expensive than Gemini with no quality edge. Not recommended for volume. |
Baseline comparison on Gemini (medium transcript): staged = 3 calls, $0.00445, 13.5 s; bundled = 3 calls, $0.00438, 11.2 s; extraction_bundled = 2 calls, $0.00265, 9.7 s; mega_bundled = 1 call, $0.00127, 17.8 s. Mega-bundled = 72 % cheaper than staged for the same artifact counts.
Profile defaults after validation:
cloud_balancedβ Gemini flash-lite mega_bundled (cheapest, same quality as extraction_bundled, simpler pipeline).cloud_qualityβ Anthropic haiku-4.5 mega_bundled (entity F1 = 1.000 per #632, most specific entity names, 15 s latency).
Quality caveat: entity F1 numbers from #632 (Anthropic 1.000 vs DeepSeek 0.88) were on fixture transcripts. Real-episode artifact counts are now comparable across all 6 providers; differences are subtler (name specificity, edge-case entity capture). Re-score with the eval harness if entity fidelity is critical.
Set the mode in your profile:
llm_pipeline_mode: "mega_bundled" # or: "extraction_bundled", "bundled", "staged"
cloud_llm_structured_min_output_tokens: 4096 # floor for JSON responses (#645)
Full reference β v2 matrix details¶
The two picks above cover most deployments. Everything below is the detailed reference for edge cases and the data that backs those recommendations.
All 6 cloud LLM providers evaluated under the v2 framework: dev/held-out split, fraction-based
judge contestation, champion prompts ported across providers. Held-out ROUGE-L vs Sonnet 4.6
silver on curated_5feeds_benchmark_v2 (5 unseen episodes, ~32 min each):
| Track | OpenAI | Anthropic | Gemini | Mistral | DeepSeek | Grok |
|---|---|---|---|---|---|---|
| Bullets non-bundled | 39.6% | 40.7% | 40.1% | 37.3% | 43.1% | 38.6% |
| Bullets bundled | 33.2% | 39.3% | 28.5% | 30.4% | 34.4% | 30.5% |
| Paragraph non-bundled | 31.7% | 36.4% | 29.3% | 27.8% | 40.7% | 31.3% |
| Paragraph bundled | 29.5% | 39.2% | 26.6% | 30.3% | 35.9% | 28.9% |
Cell winners:
- Non-bundled (either track): DeepSeek (
deepseek-chat). Cheapest cloud non-local option that also leads on both quality metrics β the clear sweet spot. - Bundled (either track): Anthropic (
claude-haiku-4-5). Only provider where bundled is competitive with non-bundled; bundled paragraph even beats non-bundled paragraph.
Local (Ollama) β v2 held-out, 11 models evaluated β Core 5 standardized¶
ADR-077 (2026-04-18): Standardized on 5 models for regular sweeps and pipeline validation. One per family + one large-scale reference: qwen3.5:9b (champion), llama3.2:3b (speed), mistral:7b (mid-tier), gemma2:9b (diversity), qwen3.5:35b (scale reference). Dropped 6 models that were same-family duplicates or structurally unsuitable. See ADR-077 for rationale.
Top 6 local bullets non-bundled (held-out):
| Rank | Model | Size | Bullets final |
|---|---|---|---|
| 1 | qwen3.5:9b | 9B | 0.580 |
| 2 | qwen3.5:35b | 35B | 0.576 |
| 3 | qwen3.5:27b | 27B | 0.543 |
| 4 | mistral-small3.2 | ~22B | 0.536 |
| 5 | mistral:7b | 7B | 0.526 |
| 6 | llama3.1:8b | 8B | 0.518 |
Cross-matrix ranking (bullets non-bundled held-out):
| Rank | Provider+model | Final |
|---|---|---|
| 1 | DeepSeek (cloud) | 0.586 |
| 2 | Ollama qwen3.5:9b (local, free) | 0.580 |
| 3 | Ollama qwen3.5:35b | 0.576 |
| 4 | Anthropic haiku-4.5 | 0.570 |
| 5 | OpenAI gpt-4o | 0.566 |
Paragraph β use bundled on local (big finding):
| Model | Non-bundled paragraph | Bundled paragraph |
|---|---|---|
| qwen3.5:9b | 0.505 (2/5 contested) | 0.509 (uncontested) |
| qwen3.5:35b | 0.325 (contested β ROUGE-only) | 0.492 |
| mistral-small3.2 | 0.288 (contested β ROUGE-only) | 0.468 |
Non-bundled paragraph contests on 3 of 4 local models (long transcripts β inconsistent structure β judge disagreement). Bundled paragraph doesn't contest at all β the JSON schema stabilises output. Bundled is the correct local-deployment choice for paragraph.
Local picks:
- Bullets (either mode) β
qwen3.5:9bnon-bundled (0.580). - Paragraph β
qwen3.5:9bbundled (0.509). Same single call produces both. - Single call, title+summary+bullets β
qwen3.5:9bbundled. Loses ~9% on bullets vs non-bundled but gains reliability and cost efficiency. - No-Ollama-daemon alternative β
DISLab/SummLlama3.2-3Bvia HF transformers directly. v2 held-out paragraph 0.485 (uncontested 0/5, dev 0.442), bullets 0.416 (uncontested 0/5, dev 0.467). Runs viarun_summllama_v2.py(HFtransformers+ MPS / CUDA). DPO-tuned on faithfulness/completeness/conciseness β the same Llama-3.2-3B base that scores only 0.270 paragraph standalone lifts to 0.485 with alignment. Paragraph-strong, bullets- weaker (DPO was prose-shaped, not list-shaped). Latency 60-156s/ep on Apple MPS, slower than Ollama but operationally simpler (no daemon, one Python process). Pick this for paragraph-first deployments or when Ollama can't be run. For bullet-heavy workloads, qwen3.5:9b bundled stays the better local pick. See Held-out v2 report Β§6a.
Default picks by use case (compound-scored across quality, latency, cost β see Held-out v2 report Β§Compound analysis):
| Priority | Best pick | Why |
|---|---|---|
| Balanced default | Gemini 2.5-flash-lite non-bundled | 0.564 / 0.479, 1.5s, ~$0.00047/ep. New 2026-04-16 β strict upgrade over 2.0-flash. |
| Quality first | DeepSeek non-bundled | 0.586 (#1), 10s, $0.0005/ep. Top quality at near-bottom cost. |
| Single-call bundled | Anthropic haiku-4.5 bundled | 0.552 bullets / 0.548 paragraph, 7s, $0.006/ep. Only provider where bundled is competitive. |
| Throughput / real-time | Gemini 2.5-flash-lite | 1.5s/ep β fastest in the matrix. 2Γ faster than Gemini 2.0-flash at same quality tier. |
| OpenAI-ecosystem (cost-sensitive) | gpt-4o-mini | 0.540 / 0.469, 6.6s, ~$0.00074/ep. 16Γ cheaper than gpt-4o for 4% quality hit. |
| Privacy / offline | Ollama qwen3.5:9b | 0.580 (local #1), 33s/ep, free. Matches DeepSeek quality offline. |
Avoid / dominated (unless ecosystem-locked):
- OpenAI gpt-4o: Anthropic haiku-4.5 is better on quality AND ~3Γ cheaper.
- Grok: slower than Anthropic/Gemini without compensating quality.
- OpenAI bundled, Gemini bundled, Mistral non-bundled paragraph, local paragraph on long transcripts: structural weak spots visible in the matrix.
- Ollama qwen3.5:27b / qwen3.5:35b: larger but not better than qwen3.5:9b.
See v2 eval report for blended scores, dev numbers, generalisation analysis, and provider-specific quirks.
Full pipeline validation (2026-04-18, PR #603)¶
All providers tested through the complete pipeline (summary β GI β KG β bridge) on 5 held-out episodes.
make pipeline-validateverifies each stage produces valid output.
| Provider | Summary | GI | Grounding | KG | Bridge |
|---|---|---|---|---|---|
| openai/gpt-4o-mini | β | β | 100% | β | β |
| gemini/flash-lite | β | β | 98% | β | β |
| anthropic/haiku-4.5 | β | β | 100% | β | β |
| deepseek/deepseek-chat | β | β | 100% | β | β |
| mistral/mistral-small | β | β | 100% | β | β |
| grok/grok-3-mini | β | β | 100% | β | β |
| ollama/qwen3.5:9b | β | β | 100% | β | β |
| ollama/llama3.1:8b | β | β | β | β | β |
| ollama/mistral:7b | β | β | 98% | β | β |
| ollama/gemma2:9b | β | β οΈ | 95% | β | β |
| ollama/qwen3.5:35b | β | β | 98% | β | β |
All 11 providers pass the full pipeline. gemma2:9b is borderline on GI insight count (7.8/ep vs 8 threshold, instruction-following gap). Minimum viable model size for full pipeline: 7-8B (3B models fail on KG entity extraction). See ADR-077.
Implementation Status¶
All providers below are implemented and acceptance-tested (v2.4.0+).
| Provider | Status | RFC | Notes |
|---|---|---|---|
| Local ML | Implemented | - | Default provider (Whisper + spaCy + Transformers): transcription, speaker detection, summarization |
| Hybrid ML | Implemented | RFC-042 | Summarization only: MAP (LongT5) + REDUCE (transformers / Ollama / llama_cpp) |
| OpenAI | Implemented | RFC-013 | Transcription + summarization (Whisper API + GPT API) |
| Gemini | Implemented | RFC-035 | Transcription + summarization (no speaker detection) |
| Mistral | Implemented | RFC-033 | Summarization only (EU data residency) |
| Anthropic | Implemented | RFC-032 | Summarization only (no transcription or speaker detection) |
| DeepSeek | Implemented | RFC-034 | Summarization only; ultra low-cost |
| Grok | Implemented | RFC-036 | Summarization only; real-time information access |
| Ollama | Implemented | RFC-037 | Transcription, speaker detection, summarization; local self-hosted LLMs, zero cost, complete privacy |
For hybrid_ml (MAP-REDUCE) configuration and REDUCE backends (Ollama, llama_cpp,
transformers), see ML Provider Reference and
Configuration API. Transcript cleaning: hybrid_ml honors
transcript_cleaning_strategy like API providers; with pattern, use
hybrid_internal_preprocessing_after_pattern (CLI: --hybrid-internal-preprocessing-after-pattern)
to control internal MAP preprocessing after workflow cleaning (RFC-042).
LLM / hybrid LLM stage: long transcripts can hit the length-ratio guard (pattern-cleaned
input β₯2000 chars and LLM output below 20% of that length is discarded); see
CONFIGURATION β LLM cleaning length guard.
Key Statistics at a Glance¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PROVIDER LANDSCAPE OVERVIEW β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 9 Summarization Options β (Hybrid = MAP+REDUCE) β 3 Full-Stack Ready β
β ββββββββββββββββββββ β βββββββββββββββββββββββ β βββββββββββββββ β
β Local ML β Hybrid ML (RFC-042) β Local ML β
β Hybrid ML β MAP + Ollama/llama_cpp β OpenAI (tx+sum) β
β OpenAI β or transformers REDUCE β Ollama β
β Gemini β β β
β Mistral β β β
β Anthropic / DeepSeek β β β
β Grok / Ollama β β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β COST SPECTRUM (per 100 episodes) β
β β
β $0 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ $37 β
β β β β
β βΌ βΌ β
β Local/Ollama OpenAI (full) β
β ($0) ($37) β
β β
β DeepSeek βββ Grok βββ Anthropic βββ Gemini βββ OpenAI (text) βββ OpenAI β
β ($0.02) ($0.03) ($0.40) ($0.95) ($0.55) ($37) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quick Decision Matrix¶
Updated picks (v2 data, 2026-04-16): Default local-LLM is now qwen3.5:9b (not llama3.2:3b). Hybrid ML is demoted from default recommendation to narrow niche β v2 showed standalone qwen3.5:9b (0.509 paragraph held-out) beats hybrid bart+qwen3.5:9b (0.448) for our workload.
| If you need... | Choose | Why |
|---|---|---|
| Complete Privacy | Local ML / Ollama | Data never leaves your device |
| Lowest Cost | Local ML / Ollama | $0 (just electricity) |
| Air-gapped (no Ollama) | Local ML (ml_bart_led_autoresearch_v1) |
v2 held-out 0.206 (weak but zero-deps floor) |
| Air-gapped + Ollama | Ollama (qwen3.5:9b bundled) |
v2 held-out 0.509 paragraph / 0.529 bullets β the recommended local pick |
| Highest Quality (cloud) | DeepSeek (non-bundled) | v2 held-out 0.586 bullets / 0.541 paragraph β top of matrix |
| Fastest Cloud | Gemini 2.5-flash-lite | 1.5s/ep β fastest in v2 matrix |
| On-prem, quality first | Ollama qwen3.5:9b bundled |
Best local quality (0.509 paragraph, uncontested) |
| On-prem, speed first | Ollama (llama3.2:3b) |
12s/ep, quality floor 0.501 bullets |
| On-prem, bullets-only max quality | Ollama (qwen3.5:9b non-bundled) |
0.580 held-out (2nd overall, beats most cloud) |
| On-prem, no Ollama daemon, paragraph-first | HF transformers (DISLab/SummLlama3.2-3B) |
0.485 held-out paragraph / 0.416 bullets β DPO-tuned 3B via HF transformers on MPS/CUDA directly; operationally simpler than Ollama. Paragraph-strong, bullets-weaker (DPO was prose-shaped). |
| Full Capabilities | OpenAI / Local ML | All 3 capabilities (transcription + speaker + summary) |
| Hybrid MAP-REDUCE | Hybrid ML (Ollama/llama_cpp) | Retained as niche option (RFC-042); not recommended as default β see v2 findings |
| Real-Time Info | Grok | Real-time information access (RFC-036) |
| Lowest Cloud Cost | Gemini 2.5-flash-lite / DeepSeek | ~$0.0005/ep both β comparable |
| EU Data Residency | Mistral | European servers (RFC-033) |
| Huge Context | Gemini | 2 million token window (RFC-035) |
| Free Development | Gemini / Grok | Generous free tiers (RFC-035, RFC-036) |
| Self-Hosted | Ollama | Offline/air-gapped (RFC-037) |
Screenplay formatting vs transcription (GitHub #562)¶
transcription_provider |
screenplay: true in transcript body |
|---|---|
whisper (local) |
Yes β gap-based speaker labels via MLProvider |
openai (whisper-1 API) |
No β plain text transcript; segment JSON may still exist |
gemini / mistral |
No β plain text in our integration (provider wiring here) |
Validation coerces truthy screenplay to false when the transcription provider is not whisper, with a single INFO while a process gate is set (the gate resets when each run_pipeline finishes so another Config build can log again). Details: CONFIGURATION.md β Screenplay vs transcription.
Empirical Highlights¶
All claims below are backed by measured data. For the full metrics tables, methodology, and metric definitions, see the Evaluation Reports.
Note on the silver reference: Results were re-measured in April 2026 against
silver_sonnet46_benchmark_v1(Claude Sonnet 4.6, 10-episode benchmark scale). Rankings shifted significantly from March 2026 β see why the rankings changed. The March 2026 numbers (vs GPT-4o silver) are preserved in the March report for reference.
Full quality ladder β all four tiers¶
Every summarization option in one view, ordered by ROUGE-L. ML/hybrid numbers are smoke-scale (5 eps); cloud and Ollama numbers are benchmark-scale (10 eps).
All numbers benchmark-scale (10 eps, curated_5feeds_benchmark_v1 vs
silver_sonnet46_benchmark_v1).
| Tier | Mode | ROUGE-L | Embed | Lat/ep | Dependencies |
|---|---|---|---|---|---|
| 1 β ML Dev | ml_small_authority |
19.1% | 70.0% | fast | None (CI safe) |
| 2 β ML Prod | ml_bart_led_autoresearch_v1 |
20.5% | 68.2% | 26s | None (air-gap safe) |
| β Hybrid | ml_hybrid_bart_llama32_3b_autoresearch_v1 |
21.1% | 76.6% | 15s | Ollama (3B only) |
| 3 β LLM Local (small) | llama3.2:3b direct |
24.4% | 78.6% | 8.5s | Ollama |
| 3 β LLM Local (large) | qwen3.5:35b direct |
31.9% | 81.5% | 21s | Ollama |
| 4 β LLM Cloud (mid) | Gemini 2.0 Flash | 28.7% | 82.5% | 2.7s | API key |
| 4 β LLM Cloud (mid) | DeepSeek | 29.5% | 83.6% | 8.9s | API key |
| 4 β LLM Cloud (best) | Anthropic Haiku 4.5 | 33.7% | 86.2% | 5.0s | API key |
Key observations:
- The jump from ML-prod (20.5%) to direct-LLM (24.4%) is ~4 ROUGE-L points. The hybrid (21.1%) closes only ~1.5 of those points at benchmark scale β less than smoke suggested (23.7%). The gap exists because temperature=0.5 sampling variance averages down over 10 episodes. The hybrid is still valuable for long transcripts (BART MAP chunks arbitrary-length input).
- qwen3.5:35b (31.9%) is the only on-prem model in the cloud quality range β it exceeds Gemini, OpenAI, and Grok.
- The hybrid is the right choice when: transcripts exceed LLM context windows, Ollama is available but only a small model fits in VRAM, or quality must improve over ML-prod without paying for cloud.
Cloud providers β paragraphs (vs Sonnet 4.6 silver, April 2026)¶
Best cloud provider: Anthropic (claude-haiku-4-5) β leads on ROUGE-L and
embedding similarity across both smoke (5 eps) and benchmark (10 eps) runs. Gemini
(gemini-2.0-flash) remains the fastest cloud option. Numbers below are benchmark
(10-episode, more stable).
| Provider | ROUGE-L | Embed | Latency |
|---|---|---|---|
| Anthropic | 33.7% | 86.2% | 5.0s |
| DeepSeek | 29.5% | 83.6% | 8.9s |
| Gemini | 28.7% | 82.5% | 2.7s |
| Mistral | 28.0% | 82.3% | 4.6s |
| OpenAI | 26.8% | 84.1% | 8.5s |
| Grok | 26.7% | 81.7% | 7.5s |
The Anthropic model used here is
claude-haiku-4-5(smallest/fastest Haiku). The silver reference is Claude Sonnet 4.6 β Anthropic scores well partly because the models share a generation family.
Full table: Benchmark v1 report β Cloud providers
Local Ollama β paragraphs (vs Sonnet 4.6 silver, April 2026)¶
Best local model: Qwen 3.5:35b at 31.9% ROUGE-L (21s/ep) β the only on-prem model above the cloud median, competitive with Gemini and Mistral API. Mistral Small 3.2 (28.1%, 89s/ep) is a strong mid-tier option. llama3.2:3b (24.4%, 8.5s) is the best fast/low-resource choice. Numbers below are benchmark scale (10 eps).
| Model | ROUGE-L | Embed | Latency |
|---|---|---|---|
| qwen3.5:35b | 31.9% | 81.5% | 21s |
| mistral-small3.2 | 28.1% | 81.4% | 89s |
| qwen2.5:32b | 24.6% | 80.7% | 78s |
| qwen3.5:9b | 25.7% | 78.0% | 226sβ |
| llama3.2:3b | 24.4% | 78.6% | 8.5s |
Latencies are hardware-dependent (Apple M-series). Re-run on your machine. β qwen3.5:9b and qwen3.5:27b showed CPU-offload latency anomalies in the benchmark run.
Local Ollama β bullets (vs Sonnet 4.6 bullets silver, April 2026)¶
For bullet JSON output, qwen3.5:35b leads at benchmark scale (36.2% ROUGE-L, 14s/ep). llama3.2:3b is the fastest option (33.6% ROUGE-L, 5.2s/ep). qwen2.5:7b does not reliably follow the JSON format β avoid it for the bullets track.
| Model | ROUGE-L | Embed | Latency |
|---|---|---|---|
| qwen3.5:35b | 36.2% | 87.3% | 14.1s |
| qwen3.5:27b | 35.2% | 88.4% | 63.2s |
| mistral-small3.2 | 34.2% | 84.3% | 39.9s |
| llama3.2:3b | 33.6% | 82.9% | 5.2s |
| qwen3.5:9b | 32.6% | 83.5% | 16.7s |
Full tables: Benchmark v1 report (April 2026)
Detailed Cost Analysis¶
Per 100 Episodes β Complete Breakdown¶
| Provider | Transcription | Speaker | Summary | Total | vs OpenAI |
|---|---|---|---|---|---|
| Local ML | $0 | $0 | $0 | $0 | -100% |
| Ollama | N/A | $0 | $0 | $0 | -100% |
| DeepSeek | N/A | N/A | $0.016 | $0.016 | -97% |
| Grok (beta) | N/A | N/A | $0.00 | $0.00 | -100% |
| Mistral (Small) | N/A | N/A | $0.11 | $0.11 | -80% |
| Anthropic (Haiku) | N/A | N/A | $0.40 | $0.40 | -27% |
| Gemini (Flash) | $0.90 | N/A | $0.05 | $0.95 | +73% |
| OpenAI (Nano) | $36.00 | N/A | $0.28 | $36.28 | baseline |
| OpenAI (Mini) | $36.00 | N/A | $1.40 | $37.40 | +3% |
| Mistral (Large) | N/A | N/A | $9.00 | $9.00 | -75% |
Monthly Cost Projections¶
Monthly costs at different scales
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
100 ep/month 1,000 ep/month 10,000 ep/month
ββββββββββββ ββββββββββββββ βββββββββββββββ
Local ML $0 $0 $0
DeepSeek $0.02 $0.16 $1.60
Grok $0.03 $0.26 $2.60
Anthropic $0.40 $4.00 $40.00
OpenAI (text only) $0.55 $5.50 $55.00
OpenAI (full) $37.40 $374.00 $3,740.00
At 10,000 episodes/month, OpenAI full stack costs $3,740!
Using local transcription + DeepSeek: $1.60 (99.96% savings)
Key insight: Transcription dominates cloud costs (90%+). Use local Whisper + cloud text processing to save massively.
Decision Flowchart¶
START
β
βΌ
βββββββββββββββββββ
β What's your β
β TOP priority? β
ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β PRIVACY β β COST β β QUALITY β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
βΌ βΌ βΌ
Need transcription? Need transcription? Budget matters?
β β β
ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ
βYes βNo β βYes βNo β βYes βNo β
βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ βΌ
ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ
βLOCAL β βOLLAMAβ βLOCAL β βDEEP β βGPT-5 β βGPT-5 β
β ML β β β βWhisperβ βSEEK β β Mini β β β
β β β β β + β β β β β β β
β β β β βDeepSk β β β β β β β
ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ ββββββββ
ββββββββββββββββββββββΌβββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β SPEED β β CONTEXT β β EU β
ββββββ¬βββββ ββββββ¬βββββ ββββββ¬βββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β GROK β β GEMINI β β MISTRAL β
β β β Pro β β β
β Real-Timeβ β 2M β β Full β
β faster β β tokens β β Stack β
βββββββββββ βββββββββββ βββββββββββ
Autoresearch-derived defaults (2026-04)¶
These are the research-backed settings used across config/profiles/ (main
profiles + config/profiles/freeze/*.yaml per-provider captures). Source data in
docs/guides/eval-reports/ and data/eval/runs/ner_*_smoke_v1/.
Since 2026-04, the defaults are split across named presets so a single edit updates every deployment profile that references them (GitHub #634):
- Audio preprocessing:
config/profiles/audio/speech_optimal_v1.yamlβ referenced by all 5 deployment profiles viaaudio_preprocessing_profile: speech_optimal_v1. See the file's own comment block for rationale and data; seeconfig/profiles/audio/README.mdfor the pattern (both live underconfig/, outside the MkDocs tree). - Text cleaning (ML-only):
ml_preprocessing_profileonConfig(e.g.cleaning_v4), overrides themode_cfg.preprocessing_profiledefault in the ML summary registry. Cloud LLM and Ollama providers send raw transcripts and ignore this field.
NER / speaker detection¶
Smoke eval (5 episodes, 15 gold entities, data/eval/runs/ner_*_smoke_v1/):
| Backend | F1 | Notes |
|---|---|---|
spaCy trf (en_core_web_trf) |
1.000 | Free, local, deterministic |
spaCy sm (en_core_web_sm) |
0.966 | Faster, slightly lower recall |
| OpenAI, Anthropic, Gemini, Mistral, DeepSeek, Grok | 1.000 | All six cloud LLMs tie spaCy trf |
Ollama qwen3.5:9b |
0.750 | Misses 6/15 entities; weakest tested |
Default for main profiles: speaker_detector_provider: spacy +
ner_model: en_core_web_trf. Ties every cloud LLM on the available data while
saving the API call. Post-reingestion validation on a larger/harder corpus is
tracked in POST_REINGESTION_PLAN.md Step 6.
Exception β per-provider freeze profiles (config/profiles/freeze/<name>.yaml):
these intentionally route NER through <provider> to exercise each provider's
full surface in profile/cost capture runs.
All 6 cloud providers + Ollama have detect_speakers wired (see
src/podcast_scraper/providers/*/...). Earlier "summarization only" notes in
this guide are obsolete.
Grounded Insights (GI)¶
gi_insight_source: provider(notsummary_bullets) β +10pp insight coverage vs silver. Used bycloud_balanced,cloud_quality,local.airgappedkeepssummary_bulletsbecause SummLlama3.2-3B is summary-only, not chat.gi_max_insights: 12β autoresearch sweet spot; default is 20, which is long-tail and hurts precision.gi_require_grounding: true(default) β drops insights without grounded quote evidence.
Knowledge Graph (KG)¶
kg_extraction_source: providerβ +37pp topic coverage vs silver (measured on KG pipeline, not summary-bullet proxy).kg_max_topics: 10β autoresearch sweet spot; default is 20.kg_max_entities: 15β matches default; standardised across profiles.- KG v3 prompt (noun-phrase enforcement, promoted in #625 / PR #628) β already the default.
Insight clustering¶
- Default threshold 0.75 (
src/podcast_scraper/search/insight_clusters.py). Configured at call site, not via profile YAML.
Pipeline mode¶
llm_pipeline_mode: bundledβ only for localqwen3.5:9b: bundled schema stabilises paragraph output for Ollama (+structure reliability).llm_pipeline_mode: staged(default) β for Gemini, OpenAI, Anthropic (except Haiku bundled edge case), DeepSeek, Mistral, Grok. On Gemini and OpenAI, bundled costs 5-12% quality for no real gain.
Transcription¶
whisper_model: small.enβ 9.5% WER sweet spot, 148s/ep on CPU. Use this whenever the provider doesn't offer its own transcription.whisper_model: base.enβ 40s/ep, ~13% WER. Faster but noticeably worse quality; reserved fordevprofile.- Providers with native transcription: OpenAI (
whisper-1), Gemini (gemini-2.5-flash-lite), Mistral (voxtral-mini-latest). For these, per-provider capture profiles use the provider's own transcription path. - Anthropic, DeepSeek, Grok, Ollama have no transcription β fall back to
local Whisper (
small.en).
Transcription head-to-head (#577 Exp 3, 2026-04-20, 5 NPR eps ~30 min each)¶
Audio fixed at 32 kbps / 16 kHz / mono (Exp 1 winner). Reference transcripts
from curated_5feeds_benchmark_v2. Local Whisper small.en baseline WER ~11%.
| Provider / Model | WER (avg) | $/ep (32-min) | Wall | Notes |
|---|---|---|---|---|
openai/whisper-1 |
8.2% | $0.20 | 68s | No caps; hard-coded anti-loop filters make it the most stable cloud option |
mistral/voxtral-mini-latest |
8.6% (clean) | $0.034 | 21s | 6Γ cheaper than whisper-1 but 1/5 eps hallucinated (109K-char loop); no native anti-loop filters. Requires temperature=0.0 (applied) + output-length sanity check (applied) + pre-chunk to <25 min. New model voxtral-mini-transcribe-2 (Feb 2026) ships with diarization/timestamps and is expected to replace voxtral-mini-latest for batch transcription β upgrade tracked separately. |
openai/gpt-4o-transcribe |
β | β | β | Hard 1400s (23 min) duration cap. All 5 eps failed. Needs chunking (see #286). |
openai/gpt-4o-mini-transcribe |
β | β | β | Token budget cap (narrower than the 1400s cap). All 5 eps failed. Needs chunking. |
gemini/gemini-2.5-flash-lite |
72β931% | $0.01 | 16β121s | Not suitable for verbatim transcription. LLM-based audio without anti-loop filters. At default max_output_tokens=8192 silently truncates β summary-style (72% WER). At raised cap, runs the hallucination loop (931% WER). Use gemini-2.5-pro or -flash with thinking for better results if Gemini is required; otherwise prefer Whisper/Voxtral. |
Cost lever hierarchy for Whisper API path¶
- File-size vs duration β API cost is duration-based. Bitrate affects file size (upload speed, 25 MB cap) but not per-minute cost. Lower bitrate is still worth it for smaller files (Exp 1: 32 kbps is the sweet spot).
- Silence trim β direct duration cut. On tightly-edited benchmark fixtures
the filter removed 0% (fixtures have no silences > 1 s at any threshold).
On production NPR audio (#577 Exp 2-prod, 2026-04-20, 2 Γ 32-min episodes):
-50 dB / 2.0 s(the previous default) trims 0%.-30 dB / 0.5 strims 3.6% with < 1% transcript char drop.-25 dB / 0.5 strims 6.7% with similar char drop (held pending validation on more diverse podcast types).speech_optimal_v1.yamlnow uses-30 dB / 0.5 sas the moderate sweet spot β across all 5 deployment profiles that's a direct 3.6% API $ saving on top of whatever provider they pick. - Cheaper model β
voxtral-mini-latestis 6Γ cheaper at similar clean-run quality (pending hallucination mitigations).gpt-4o-mini-transcribeis 50% cheaper thanwhisper-1but blocked on chunking.
Cross-provider caps and constraints¶
| Model | File size | Duration | Token budget | Notes |
|---|---|---|---|---|
openai/whisper-1 |
25 MB | β | β | Most permissive OpenAI option |
openai/gpt-4o-transcribe |
β | 1400s | β | Hard duration cap |
openai/gpt-4o-mini-transcribe |
β | β | yes (tighter than 1400s) | Needs chunking for any real podcast ep |
mistral/voxtral-mini-latest |
n/a | 30 min (documented ceiling) | β | Exceeding 30 min triggers hallucination loops |
gemini/gemini-2.5-flash-lite |
inline ~20 MB (Files API β₯ 20 MB) | β | 8192 output default, 65536 cap | Silent summary-truncation at default cap |
Local vs API breakeven β when to pick which (#577 Exp 4)¶
Inputs (measured, 32-min NPR episode, 32 kbps input):
| Path | Wall per ep | $ per ep | Throughput (parallel) |
|---|---|---|---|
Local Whisper small.en on MPS |
~100s | $0.00 | 1 ep at a time (MPS shared) |
Cloud whisper-1 |
~68s | $0.20 | 50+ concurrent (tier-1 rate limit) |
Cloud voxtral-mini-latest (clean runs) |
~21s | $0.034 | 10+ concurrent |
Decision rules:
- Volume < 100 eps/day AND you're not time-sensitive β local wins. Zero cost; the 100s/ep on MPS is fine to run overnight. API $ savings (at most $20/day on whisper-1) don't justify the API complexity or hallucination risk.
- Volume 100β1,000 eps/day OR you need results inside an hour β cloud API
wins on wall-clock time. Pick
whisper-1for reliability (no chunking needed, no hallucination surprises). Expect$20β$200/day. - Volume > 1,000 eps/day OR cost-sensitive at scale β
voxtral-miniwith hallucination mitigations applied (temperature=0.0, pre-chunk to < 25 min, post-hoc length sanity check with fallback towhisper-1). 6Γ cheaper thanwhisper-1; expected ~$35/1,000 epsnet after fallback overhead.
Pathological case β a single ~3,000-word episode as fast as possible:
voxtral-mini-latest wins wall-time (~21s). whisper-1 is the conservative
alternative (~68s, predictable). Both beat local Whisper (~100s sequential).
Break-even table (rough, 32-min NPR avg episode):
| Daily volume | Local (1Γ MPS) | whisper-1 (50 parallel) | voxtral-mini (10 parallel + fallback) |
|---|---|---|---|
| 10 eps | 17 min, $0 | 14s, $2 | 21s, $0.40 |
| 100 eps | 2.8 hours, $0 | 2.3 min, $20 | 3.5 min, $4 |
| 1,000 eps | 28 hours, $0 | 23 min, $200 | 35 min, $40 |
| 10,000 eps | 11.5 days, $0 | 3.8 hours, $2,000 | 5.8 hours, $400 |
Rule of thumb: once you need throughput above 1 ep/min sustained, local on a
single Mac is out. Cloud API cost becomes the binding constraint, which is
why voxtral-mini with mitigations is the cost-optimal long-horizon pick.
Recommended Configurations¶
Configuration 1: Ultra-Budget ($0.016/100 episodes)¶
# 97% cheaper than OpenAI. DeepSeek has detect_speakers wired but NER smoke
# has spaCy trf tied at F1=1.000, so local spaCy is cheaper and equivalent.
transcription_provider: whisper # Free (local; DeepSeek no transcription API)
whisper_model: small.en # Production quality
speaker_detector_provider: spacy # Free, local; F1=1.000 smoke
ner_model: en_core_web_trf
summary_provider: deepseek # $0.016/100, leads v2 bullets
deepseek_summary_model: deepseek-chat
deepseek_api_key: ${DEEPSEEK_API_KEY}
Configuration 2: Quality-First (~$42/100 episodes)¶
# Maximum quality via OpenAI (transcription + summarization).
transcription_provider: openai
openai_transcription_model: whisper-1
speaker_detector_provider: spacy # F1 tied; skip the API call
ner_model: en_core_web_trf
summary_provider: openai
openai_summary_model: gpt-4o-mini # v2 eval model; gpt-4o for top quality
openai_api_key: ${OPENAI_API_KEY}
Configuration 3: Privacy-First ($0)¶
# Data never leaves your device.
transcription_provider: whisper # Local
whisper_model: small.en # 9.5% WER sweet spot
speaker_detector_provider: spacy # Local spaCy trf > qwen3.5:9b NER (F1 1.0 vs 0.75)
ner_model: en_core_web_trf
summary_provider: ollama # Local Ollama
ollama_summary_model: qwen3.5:9b # v2 local champion
llm_pipeline_mode: bundled # Stabilises Ollama JSON output
Configuration 4: Speed-First (~$0.25/100 episodes)¶
# Fast cloud summarization via Grok.
transcription_provider: whisper # Local (Grok no transcription API)
whisper_model: small.en
speaker_detector_provider: spacy # Free, tied F1 on smoke
ner_model: en_core_web_trf
summary_provider: grok
grok_summary_model: grok-3-mini # Eval-validated; grok-2/grok-beta are stale IDs
grok_api_key: ${GROK_API_KEY}
Configuration 5: EU Compliant (Mistral)¶
# European data residency end-to-end (Mistral has its own transcription).
transcription_provider: mistral
mistral_transcription_model: voxtral-mini-latest
speaker_detector_provider: spacy # Free, tied F1 on smoke
ner_model: en_core_web_trf
summary_provider: mistral
mistral_summary_model: mistral-large-latest
mistral_api_key: ${MISTRAL_API_KEY}
Configuration 6: Free Development (~$0)¶
# Maximize free tiers. Gemini free tier covers transcription + summary.
transcription_provider: gemini # Gemini has audio input
gemini_transcription_model: gemini-2.5-flash-lite
speaker_detector_provider: spacy # Free, tied F1 on smoke
ner_model: en_core_web_trf
summary_provider: gemini
gemini_summary_model: gemini-2.5-flash-lite
gemini_api_key: ${GEMINI_API_KEY}
Summary β v2 held-out key takeaways¶
See eval-reports/EVAL_HELDOUT_V2_2026_04.md
for the full matrix. Headline findings:
| Axis | Winner | Score / note |
|---|---|---|
| Cheapest cloud | DeepSeek | $0.016/100 eps, leads bullets non-bundled (43.1%) |
| Compound-scored cloud default | Gemini 2.5-flash-lite | Non-bundled: 0.564/0.479 bullets/para, 1.5s/ep, $0.47/1k eps |
| Best cloud bundled | Anthropic claude-haiku-4-5 |
Only provider where bundled is competitive (39.3% bullets, 39.2% para) |
| Best local | qwen3.5:9b bundled | 0.529/0.509 bullets/para, ~44s/ep, $0 |
| EU residency | Mistral | End-to-end: voxtral-mini-latest transcription + mistral-large-latest |
| Real-time info | Grok | X/Twitter integration |
| Local bullets leader (non-bundled) | qwen3.5:9b | 0.580 β beats qwen3.5:35b (0.576) at 1/4 the size |
Cost insight: transcription is 90%+ of cloud pipeline cost. Local Whisper
small.en + cloud summarization is the high-leverage combination. Per-provider
cost numbers are in EVAL_HELDOUT_V2;
older v1 numbers in this guide are superseded.
Rankings change when the silver reference changes β Sonnet 4.6 silver favours verbose paragraph style; different silvers may produce different orderings. See eval methodology for detail.
Related Documentation¶
- Provider Deep Dives β per-provider cards, magic quadrant, visual comparisons
- Evaluation Reports β methodology, metrics, and full comparison data
- Provider Configuration Quick Reference
- Ollama Provider Guide β complete Ollama setup and troubleshooting
- Provider Implementation Guide
- ML Provider Reference
- PRD-006: OpenAI Provider
- PRD-009: Anthropic Provider
- PRD-010: Mistral Provider
- PRD-011: DeepSeek Provider
- PRD-012: Gemini Provider
- PRD-013: Grok Provider (xAI)
- PRD-014: Ollama Provider