ADR-069: Hybrid ML Pipeline as Primary Production Summarization Direction¶
- Status: Accepted
- Date: 2026-04-03
- Authors: Podcast Scraper Team
- Related RFCs: RFC-057, RFC-042
- Supersedes: —
- See Also: ADR-043, ADR-068, ADR-070
Context & Problem Statement¶
RFC-057 Track B autoresearch established BART+LED as the best pure-ML local baseline at 18.82% ROUGE-L (ADR-068). A persistent ~10pp gap remained vs cloud models (28–32% ROUGE-L). The hybrid ML architecture (RFC-042) — classic HF model for MAP compression + local LLM for REDUCE synthesis — had been implemented but not empirically validated against the pure-ML baseline or across LLM candidates. RFC-057 Track B extended the sweep harness to cover hybrid Ollama configs to close this question with measurement.
Decision¶
Adopt the hybrid ML pipeline (HF MAP model + Ollama LLM REDUCE) as the primary production summarization direction for podcast content, superseding pure-ML BART+LED in the default path once fully validated.
The validated champion configuration is:
ml_hybrid_bart_llama32_3b_autoresearch_v1 — BART-base MAP + Llama 3.2:3b REDUCE
(temperature=0.5, top_p=1.0) at 23.1% ROUGE-L, 77.7% embedding cosine, ~17s/episode.
LLM Reduce Candidate Evaluation¶
All candidates evaluated on curated_5feeds_smoke_v1 vs. silver_sonnet46_smoke_v1 at
baseline parameters (temperature=0.3, top_p=0.9). LongT5-base MAP for all.
| Model | Size | ROUGE-L (baseline) | Embed | Latency | Notes |
|---|---|---|---|---|---|
| Llama 3.2:3b | 3B | 18.9% | 76.3% | 15s | Meta, instruction-tuned |
| Qwen 2.5:7b | 7B | 17.5% | 76.4% | 20s | Alibaba, multilingual |
| Gemma 2:9b | 9B | ~17.8% | ~75.8% | 23s | Google, academic-tuned |
| Llama 3.1:8b | 8B | ~17.2% | ~75.1% | 21s | Meta, general |
| Mistral-Nemo:12b | 12B | 18.7% | 77.7% | 19s | Mistral AI, multilingual |
Key finding — instruction-following beats raw size: Llama 3.2:3b (3B parameters) outperformed all 7–12B models at baseline. Smaller but instruction-optimized models synthesize podcast content more effectively than larger general-purpose models. Qwen 2.5:7b at 7B underperformed the 3B Llama by 1.4pp ROUGE-L despite being 2.3x larger.
Temperature as the Critical Hybrid Tuning Lever¶
The default ollama_reduce_temperature=0.3 was inherited from the Ollama provider's general
summarization default. Autoresearch sweep on Llama 3.2:3b revealed this was too conservative:
| Temperature | ROUGE-L | Delta vs 0.3 |
|---|---|---|
| 0.1 | 17.1% | -9.5% |
| 0.5 | 20.8% | +10.0% ← accepted |
| 0.7 | 19.5% | +3.2% (later rejected by top_p sweep) |
0.5 is the sweet spot: enough randomness for creative synthesis (the LLM needs to paraphrase and connect ideas across chunks) but not so much that it drifts from the source material. Temperature 0.3 produced overly conservative, extractive-style output. Temperature 0.7 introduced hallucination-adjacent drift.
The same finding held for Qwen 2.5:7b (temperature=0.5: +4.23% vs baseline).
Full Autoresearch Sweep — Llama 3.2:3b (LongT5 MAP)¶
| Param Group | Param | Candidate | ROUGE-L | Delta | Decision |
|---|---|---|---|---|---|
| ollama_reduce | temperature | 0.1 | 17.1% | -9.5% | |
| ollama_reduce | temperature | 0.5 | 20.8% | +10.0% | Yes |
| ollama_reduce | top_p | 0.7 | 20.7% | -0.5% | |
| ollama_reduce | top_p | 0.95 | 20.5% | -1.4% | |
| ollama_reduce | top_p | 1.0 | 20.4% | -1.9% | |
| — | — | — | — | early stop | — |
Round total: baseline 18.9% → 20.8% ROUGE-L (+10.0%), 1 param accepted.
Full Autoresearch Sweep — BART MAP + Llama 3.2:3b REDUCE¶
Base: hybrid_ml_bart_llama32_3b_smoke_paragraph_v1 (ROUGE-L 20.66%, post BART map swap)
| Param Group | Param | Candidate | ROUGE-L | Delta | Decision |
|---|---|---|---|---|---|
| ollama_reduce | top_p | 0.7 | 20.65% | -0.05% | |
| ollama_reduce | top_p | 0.95 | 22.71% | +9.91% | Yes |
| ollama_reduce | top_p | 1.0 | 23.12% | +1.84% | Yes |
| ollama_reduce | frequency_penalty | 0.3 | 19.55% | -15.44% | |
| ollama_reduce | frequency_penalty | 0.6 | 22.66% | -1.99% | |
| ollama_reduce | frequency_penalty | 1.0 | 20.04% | -13.35% | |
| — | — | — | early stop | — | — |
Round total: 20.66% → 23.12% ROUGE-L (+11.93%), 2 params accepted.
top_p=1.0 finding: No nucleus sampling. With temperature=0.5 already controlling randomness, removing token probability filtering lets the model access its full vocabulary for synthesis — particularly important for domain-specific podcast terminology that might sit below top-p cutoffs.
frequency_penalty finding: Any frequency penalty hurts quality with BART map input. BART produces sufficiently diverse chunk summaries that the LLM does not need external repetition penalties — adding them suppresses legitimate recurring concepts.
Final Champion vs. All Alternatives¶
| Mode | ROUGE-L | Embed | Latency | Privacy |
|---|---|---|---|---|
ml_hybrid_bart_llama32_3b_autoresearch_v1 |
23.1% | 77.7% | 17s | 100% local |
ml_hybrid_llama32_3b_autoresearch_v1 (LongT5 MAP) |
20.8% | 76.3% | 15s | 100% local |
| Mistral-Nemo:12b baseline | 18.7% | 77.7% | 19s | 100% local |
ml_bart_led_autoresearch_v1 (pure ML) |
18.8% | 72.6% | ~30s | 100% local |
| Qwen 2.5:7b (swept) | 18.2% | 76.4% | 20s | 100% local |
| OpenAI GPT-4o (cloud reference) | ~28–32% | ~82% | ~3s | cloud |
| Anthropic Claude Sonnet 4.6 (cloud) | ~32.6% | ~85% | ~3s | cloud |
The hybrid champion closes 70% of the gap between pure-ML local (18.8%) and best cloud (32.6%) at zero variable cost and with full data privacy.
Alternatives Considered¶
-
Pure ML BART+LED only — Rejected as the primary direction. Caps at ~19% ROUGE-L due to LED's abstractive quality ceiling. Useful as a privacy-maximum fallback when Ollama is unavailable.
-
LLM-only (no MAP stage) — Not evaluated. Full transcript exceeds LLM context windows without chunking; MAP compression is architecturally necessary. See RFC-042.
-
Cloud LLM reduce — Rejected. Breaks the privacy-first local principle (ADR-009) and introduces variable cost. The 9pp gap to cloud is acceptable given the privacy/cost benefit.
-
Larger Ollama models (Qwen 2.5:32b, Llama 3.3:70b) — Not swept in this phase. Results from the 3–12B comparison suggest returns diminish beyond 3B for instruction-tuned models on this task. Can be revisited if hardware constraints change.
Consequences¶
- Positive: +22.9% relative ROUGE-L over pure-ML baseline. Closes 70% of cloud quality gap.
- Positive: 17s/episode — faster than larger cloud-competing Ollama models (19–23s) and faster than pure-ML BART+LED (~30s due to beam search on CPU-loaded LED).
- Positive: Empirically validated parameter choices (temperature, top_p) replace ad-hoc defaults across all hybrid configs.
- Neutral: Requires Ollama to be running locally. Graceful fallback to
ml_bart_led_autoresearch_v1when Ollama is unavailable (existing routing logic, RFC-042). - Neutral:
PROD_DEFAULT_SUMMARY_MODE_IDmigration from pure-ML to hybrid deferred until full authority benchmark confirms generalisation beyond the smoke dataset. - Negative: Adds a runtime dependency (Ollama daemon). Production deployment must ensure Ollama is pre-started and the target model is pulled.
Implementation Notes¶
- Registry entry:
src/podcast_scraper/providers/ml/model_registry.py→_mode_registry["ml_hybrid_bart_llama32_3b_autoresearch_v1"] - Ollama params wiring:
OllamaReduceParamsdataclass insrc/podcast_scraper/evaluation/experiment_config.py; passed throughscripts/eval/experiment/run_experiment.py→src/podcast_scraper/config.py→src/podcast_scraper/providers/ollama/ollama_provider.py - Canonical eval config:
data/eval/configs/ml/baseline_ml_hybrid_bart_llama32_3b_autoresearch_v1.yaml - Sweep TSVs:
autoresearch/ml_param_tuning/results/bart_llama32_3b_sweep_*.tsv