ADR-070: BART-base as Hybrid MAP Stage Over LongT5-base¶
- Status: Accepted
- Date: 2026-04-04
- Authors: Podcast Scraper Team
- Related RFCs: RFC-057, RFC-042
- See Also: ADR-069, ADR-068
Context & Problem Statement¶
The hybrid ML pipeline (ADR-069) requires a MAP model to compress podcast transcript chunks into dense summaries that an Ollama LLM then synthesises. Two candidates were evaluated:
- BART-base (
facebook/bart-base): 1024-token context, pretrained on general text with BERT-style denoising. Already proven in pure-ML BART+LED pipeline. - LongT5-base (
google/long-t5-tglobal-base): 4096-token context, pretrained with span-corruption (T5-style). Designed for longer document understanding.
The intuition was that LongT5's 4x larger context window would produce richer chunk summaries → better LLM synthesis. This was tested empirically.
Decision¶
Use BART-base (bart-small in model registry) as the MAP stage in the hybrid ML pipeline.
LongT5-base is retained as a supported MAP model but is not the default. It remains available for experimentation and may be reconsidered if the Ollama reduce model changes.
Experimental Evidence¶
Head-to-Head Comparison (identical Ollama reduce: Llama 3.2:3b, temp=0.5, top_p=0.9)¶
Evaluated on curated_5feeds_smoke_v1 vs. silver_sonnet46_smoke_v1:
| MAP Model | ROUGE-L | Embed Cosine | Avg Output Tokens | Latency |
|---|---|---|---|---|
| BART-base | 21.24% | 77.5% | 461 | 16s |
| LongT5-base | 20.82% | 76.3% | ~430 | 15s |
| Delta | +0.42pp | +1.2pp | +31 | +1s |
BART wins on both ROUGE-L and embedding cosine despite LongT5's larger context window.
After Autoresearch Sweep (BART MAP, same reduce)¶
Further sweeping ollama_reduce params on the BART-map baseline:
| Config | ROUGE-L | Embed | Latency |
|---|---|---|---|
| BART MAP + Llama 3.2:3b baseline | 21.24% | 77.5% | 16s |
| + top_p=0.95 | 22.71% | — | — |
| + top_p=1.0 | 23.12% | 77.7% | 17s |
BART MAP config reached 23.12% ROUGE-L after sweep — 0.3pp above the LongT5 MAP best (20.82%) even before any additional tuning on the LongT5 path.
LongT5+LED Pure-ML Control¶
To further understand LongT5's properties, a pure-ML LongT5+LED mode was evaluated:
| Config | ROUGE-L | Embed | Avg Tokens | Notes |
|---|---|---|---|---|
| LongT5+LED baseline | 8.4% | 46.6% | 42 | LED generates EOS at ~42 tokens |
| + map max_new_tokens=150 | 9.9% | 49.6% | 48 | Only improvement found |
| BART+LED autoresearch_v1 | 18.8% | 72.6% | ~230 | 2x ROUGE-L, 5x tokens |
LongT5 in pure-ML mode is severely constrained: its span-corruption pretraining produces
compressed, abstract chunk summaries (~15–25 tokens/chunk) that starve LED of synthesis
material. Counter-intuitively, max_new_tokens=150 (shorter) outperformed 200/250/300
because the constraint forced more selective, information-dense compression.
This explains the hybrid result: When the LLM replace LED as the reduce model, the constraint inverts — the LLM is powerful enough to synthesise even from terse input. But BART still produces richer notes than LongT5, giving the LLM more to work with.
Why Pretraining Alignment Beats Context Window Size¶
| Property | BART-base | LongT5-base |
|---|---|---|
| Context window | 1024 tokens | 4096 tokens |
| Pretraining objective | BERT denoising (reconstruct masked spans) | Span corruption (fill in gaps) |
| Output style | Content-rich, descriptive | Compressed, abstract, short |
| Typical chunk output length | 100–150 tokens | 15–25 tokens |
| Training data | Books + Common Crawl (diverse) | C4 (web-crawled) |
| Podcast chunk compression | Preserves named entities, quotes, examples | Strips specifics, retains structure |
The podcast transcription domain shares more surface properties with BART's diverse training data than with LongT5's structured web text. BART "describes" chunk content; LongT5 "abstracts" it. For LLM synthesis, description is more useful — the LLM already knows how to abstract.
The context window advantage of LongT5 (4096 vs 1024) is not relevant in practice: podcast chunks are chunked to 900 words (~1100 tokens) to fit BART's window. LongT5's extra capacity is unused since the chunking strategy is constrained by the smaller model.
Alternatives Considered¶
-
LongT5 with tuned map params — Swept
max_new_tokens(150, 200, 250, 300),num_beams(4, 6, 8),no_repeat_ngram_size(2, 3, 4). Onlymax_new_tokens=150accepted (+17.5% in pure-ML mode). Even after tuning, LongT5+LED reached only 9.9% — 50% below BART+LED. In hybrid mode, the gap is smaller (21.2% vs 20.8%) but BART still wins. -
Larger chunk size for LongT5 — Not evaluated. Would require extending the chunking pipeline for model-specific windows. Deferred: BART wins without this change.
-
PEGASUS as MAP — Rejected. Near-duplicate chunk output makes LLM synthesis degenerate regardless of reduce model. See ADR-067.
-
Flan-T5 as MAP — Not evaluated in this round. Flan-T5 is instruction-tuned (good for prompted summarization) but not evaluated in the hybrid context. Candidate for future work.
Consequences¶
- Positive: BART-base map produces richer chunk summaries → 23.1% ROUGE-L vs 20.8% for LongT5 after equivalent sweep effort.
- Positive: Reuses the same BART model already loaded for pure-ML fallback mode — no additional model download for hybrid users.
- Neutral: LongT5 remains available as an alternative MAP stage for users who want to experiment with the longer context window (e.g., for very long episodes with fewer chunks).
- Neutral: Chunking strategy (word_chunk_size=900) continues to be driven by BART's 1024 token limit, not LongT5's 4096 limit.
- Negative: The 4096-token context advantage of LongT5 is currently unexploited. If episode length distribution shifts (e.g., very long episodes > 2h), LongT5 may become preferable with a larger chunk size.
Implementation Notes¶
- Champion config:
data/eval/configs/ml/baseline_ml_hybrid_bart_llama32_3b_autoresearch_v1.yaml - Registry entry:
src/podcast_scraper/providers/ml/model_registry.py→_mode_registry["ml_hybrid_bart_llama32_3b_autoresearch_v1"] - Sweep TSVs:
autoresearch/ml_param_tuning/results/bart_llama32_3b_sweep_*.tsv - LongT5 sweep TSVs:
autoresearch/ml_param_tuning/results/longt5_llama32_3b_sweep_*.tsv,autoresearch/ml_param_tuning/results/longt5_led_sweep_*.tsv