ADR-067: Pegasus/LED Retirement for Podcast Content¶
- Status: Accepted
- Date: 2026-04-03
- Authors: Podcast Scraper Team
- Related RFCs: RFC-057
- Supersedes: —
- See Also: ADR-010, ADR-043
Context & Problem Statement¶
Pegasus/CNN-DailyMail + LED-base-16384 (ml_prod_authority_v1) was the production summarization
mode for podcast content. During RFC-057 Track B autoresearch, empirical evaluation on
curated_5feeds_smoke_v1 revealed a fundamental architectural mismatch that cannot be resolved
through parameter tuning.
Decision¶
Retire ml_prod_authority_v1 (Pegasus MAP + LED REDUCE) as the production summarization mode
for podcast content. Preserve the entry in model_registry.py as deprecated, reserved for a
future news content type where the architectural fit is correct.
Root Cause Analysis: Three-Layer Failure¶
Layer 1 — Pretraining Domain Mismatch¶
Pegasus-CNN/DailyMail was pretrained with Gap Sentence Generation (GSG) on the CNN/DailyMail news corpus. GSG selects the most "important" sentences (by ROUGE to the rest of the document) as prediction targets. News articles have a canonical inverted-pyramid structure — a single dominant lead paragraph contains the key facts, with subsequent paragraphs adding detail.
Podcast transcripts have the opposite structure: conversational, topic-distributed, no single dominant segment. GSG applied to podcast chunks consistently selects overlapping sentences because all chunks are roughly equal in salience.
Measured evidence: Across curated_5feeds_smoke_v1, Pegasus MAP produced near-duplicate
chunk summaries — p02_e01 had 4/5 chunks with cosine similarity > 0.93 between each other.
Layer 2 — Redundant Input Exhausts LED's Diversity Mechanism¶
LED-base-16384 uses no_repeat_ngram_size=3 to prevent repetitive output. When its input
(combined Pegasus chunk summaries) is near-duplicate, the constraint fires within the first
50–70 tokens of generation. Every valid continuation is blocked — LED produces an EOS token
at ~55–65 tokens regardless of max_new_tokens.
Measured evidence: All reduce max_new_tokens candidates (450, 550, 750) produced
identical output length of ~58 avg tokens — proof that max_new_tokens was not the constraint.
Layer 3 — Parameter Tuning Cannot Fix Architecture¶
Sweeping no_repeat_ngram_size (2, 4, 5, 6, 8) produced no improvement — relaxing the
constraint did not help because the input redundancy is the root cause, not the constraint
itself. Sweeping length_penalty (1.2, 1.5, 2.0) and min_new_tokens also produced no gain.
Sweep result: 0 accepted params across all reduce and map param groups. Baseline ROUGE-L remained at ~5–7% throughout — well below BART+LED (18.8%) and hybrid modes (20–23%).
Measured Performance¶
Evaluated against silver_sonnet46_smoke_v1 reference on curated_5feeds_smoke_v1:
| Metric | ml_prod_authority_v1 (Pegasus+LED) | ml_bart_led_autoresearch_v1 (BART+LED) | Delta |
|---|---|---|---|
| ROUGE-L F1 | ~6.5% | 18.8% | +189% |
| Embedding Cosine | ~0.41 | 72.6% | +77% |
| Avg Output Tokens | ~58 | ~230 | +297% |
| Truncation Rate | 0% | 0% | — |
Why Pegasus Is Appropriate for News¶
The GSG pretraining objective is well-matched to news content:
- News has a dominant lead paragraph → GSG selects it consistently → diverse chunk summaries
- Factual, formal register matches CNN/DailyMail pretraining distribution
- Inverted-pyramid structure means later chunks genuinely add less, so LED gets real signal
When a news content type is added, ml_prod_authority_v1 should be re-evaluated without
modifications. Its architectural properties are correct for that domain.
Alternatives Considered¶
-
Tune Pegasus map params — Rejected.
repetition_penaltysweeps (1.3, 1.5, 1.7) did not produce diverse chunk summaries. The near-duplication is structural, not a decoding artifact. -
Replace LED reduce with Ollama LLM — Rejected as a Pegasus fix. This becomes the hybrid ML architecture (see ADR-069), which uses BART not Pegasus as the map model.
-
Fine-tune Pegasus on podcast data — Not evaluated. Out of scope for RFC-057 Track B, which focuses on production-ready zero-shot models.
Consequences¶
- Positive: Removes a consistently underperforming mode from the default path. Reduces confusion from having a deprecated-in-practice mode still listed as production.
- Positive: Documented retirement creates a clear record for the news content type decision.
- Neutral:
ml_prod_authority_v1remains inmodel_registry.pywithdeprecated_atanddeprecation_reasonfields. Existing configs referencing it continue to work. - Negative: Any existing deployments relying on
ml_prod_authority_v1must migrate toml_bart_led_autoresearch_v1(pure ML) orml_hybrid_bart_llama32_3b_autoresearch_v1(hybrid, recommended).
Implementation Notes¶
- Registry:
src/podcast_scraper/providers/ml/model_registry.py—ml_prod_authority_v1marked withdeprecated_at="2026-04-03"anddeprecation_reasonpointing to news content type. - Tombstone experiment:
data/eval/baselines/baseline_ml_pegasus_retirement_smoke_v1/withRETIREMENT.mddocumenting the full root cause. - Tombstone config:
data/eval/configs/ml/baseline_ml_pegasus_retirement_smoke_v1.yaml - Default:
PROD_DEFAULT_SUMMARY_MODE_IDupdated toml_bart_led_autoresearch_v1insrc/podcast_scraper/config_constants.py.
References¶
- RFC-057: AutoResearch Optimization Loop
- ADR-010: Hierarchical Summarization Pattern
- ADR-043: Hybrid MAP-REDUCE Summarization
- Tombstone:
data/eval/baselines/baseline_ml_pegasus_retirement_smoke_v1/RETIREMENT.md