ADR-068: BART+LED as Local ML Production Baseline¶
- Status: Accepted
- Date: 2026-04-03
- Authors: Podcast Scraper Team
- Related RFCs: RFC-057
- Supersedes:
ml_prod_authority_v1(Pegasus+LED) — see ADR-067 - See Also: ADR-048
Context & Problem Statement¶
Following Pegasus retirement (ADR-067), the project needed a validated local ML summarization
baseline for podcast content. ml_small_authority (BART-small + LED) existed as a development
baseline but had not been swept for optimal parameters. RFC-057 Track B defined a greedy
one-param-at-a-time sweep ratchet to find the best local ML configuration empirically.
Decision¶
Promote ml_bart_led_autoresearch_v1 — BART-base MAP + LED-base-16384 REDUCE with
autoresearch-tuned parameters — as the canonical local ML production baseline.
Register in model_registry.py and set as PROD_DEFAULT_SUMMARY_MODE_ID.
Sweep Methodology¶
RFC-057 Track B uses a greedy ratchet:
- Accept threshold: ≥ +1% relative ROUGE-L gain over current best
- Early stop: 3 consecutive rejections within a param group
- Reference:
silver_sonnet46_smoke_v1(Claude Sonnet 4.6 silver labels) - Dataset:
curated_5feeds_smoke_v1(5 episodes, 4 podcast feeds)
Sweep Results — Round 1 (Reduce Params)¶
Base config: baseline_ml_dev_authority (BART-base MAP, num_beams=4; LED REDUCE, max_new_tokens=650)
| Param | Candidate | ROUGE-L | Delta | Decision |
|---|---|---|---|---|
reduce max_new_tokens |
450 | — | rejected | |
reduce max_new_tokens |
550 | 18.54% | +2.89% | Accepted |
reduce max_new_tokens |
750 | — | rejected | |
reduce num_beams |
6 | 18.82% | +1.15% | Accepted |
reduce num_beams |
8 | — | rejected | |
reduce length_penalty |
1.2 | — | rejected | |
reduce length_penalty |
1.5 | — | rejected | |
reduce length_penalty |
0.8 | — | early stop |
Round 1 outcome: ROUGE-L 18.05% → 18.82% (+4.26%), 2 params accepted.
Sweep Results — Round 2 (Map Params)¶
Base: round-1 winner (max_new_tokens=550, num_beams=6)
| Param | Candidate | Delta | Decision |
|---|---|---|---|
map num_beams |
6 | +0.0% | |
map num_beams |
8 | — | early stop |
reduce no_repeat_ngram_size |
4, 5, 2 | ≤0% | all |
reduce min_new_tokens |
150, 280, 320 | ≤0% | all |
reduce repetition_penalty |
1.1, 1.5, 1.0 | ≤0% | all |
Round 2 outcome: No further gain. Round 1 winner is the stable optimum.
Final Promoted Configuration¶
Registered as ml_bart_led_autoresearch_v1 in model_registry.py:
map_model: bart-small (facebook/bart-base)
map_params: num_beams=4, max_new_tokens=200, min_new_tokens=80,
no_repeat_ngram_size=3, repetition_penalty=1.3
reduce_model: long-fast (allenai/led-base-16384)
reduce_params: num_beams=6, max_new_tokens=550, min_new_tokens=220,
no_repeat_ngram_size=3, repetition_penalty=1.3
preprocessing: cleaning_v4
chunking: word_chunking, word_chunk_size=900, word_overlap=150
tokenize: map_max_input_tokens=1024, reduce_max_input_tokens=4096
Measured Performance vs. Alternatives¶
Evaluated on curated_5feeds_smoke_v1 vs. silver_sonnet46_smoke_v1:
| Mode | ROUGE-L F1 | Embedding Cosine | Avg Tokens | Privacy |
|---|---|---|---|---|
ml_bart_led_autoresearch_v1 |
18.82% | 72.6% | ~230 | 100% local |
ml_prod_authority_v1 (Pegasus) |
~6.5% | ~41% | ~58 | 100% local |
ml_small_authority (pre-sweep) |
~16.3% | ~70% | ~185 | 100% local |
| OpenAI GPT-4o (cloud reference) | ~28–32% | ~82% | ~420 | cloud |
Key finding: Sweeping just 2 reduce parameters (max_new_tokens, num_beams) gave +4.26%
over the development baseline. The remaining ~10pp gap to cloud models is the motivation for
the hybrid ML architecture (ADR-069).
Why BART over Pegasus for Local ML¶
| Property | BART-base | Pegasus-CNN |
|---|---|---|
| Pretraining objective | Text infilling (BERT-like denoising) | GSG (gap sentence generation) |
| Domain | General (books + web) | News (CNN/DailyMail) |
| Podcast chunk diversity | High — produces topically diverse summaries | Low — near-duplicate (see ADR-067) |
| LED compatibility | Compatible — diverse input enables ngram budget | Incompatible — exhausts ngram budget |
Consequences¶
- Positive: 189% ROUGE-L improvement over Pegasus baseline in production.
- Positive: Establishes a reproducible sweep methodology (RFC-057 Track B ratchet) for future model promotions.
- Neutral: This mode is now the privacy-first fallback. The recommended production path is the hybrid ML pipeline (ADR-069), which surpasses this by a further +22.9%.
- Neutral:
PROD_DEFAULT_SUMMARY_MODE_IDpoints to this mode as the pure-ML anchor while hybrid validation completes.
Implementation Notes¶
- Registry entry:
src/podcast_scraper/providers/ml/model_registry.py→_mode_registry["ml_bart_led_autoresearch_v1"] - Canonical eval config:
data/eval/configs/ml/baseline_ml_bart_led_autoresearch_v1.yaml - Sweep TSVs:
autoresearch/ml_param_tuning/results/bart_led_sweep_*.tsv - Default constant:
src/podcast_scraper/config_constants.py→PROD_DEFAULT_SUMMARY_MODE_ID = "ml_bart_led_autoresearch_v1"