RFC-042: Hybrid Podcast Summarization Pipeline¶
Status¶
🟢 Completed - Implemented in v2.5
Execution Timing: Phase 2 of 3 — Implement after RFC-044 (Model Registry), before RFC-049 (GIL). This RFC populates the registry with all model families and builds the hybrid ML platform that RFC-049 consumes.
RFC Number¶
042
Authors¶
Podcast Scraper Team
Date¶
2026-01-10
Related ADRs¶
- ADR-043: Hybrid MAP-REDUCE Summarization
- ADR-044: Local LLM Backend Abstraction
- ADR-045: Strict REDUCE Prompt Contract
Related RFCs¶
- RFC-044: Model Registry (prerequisite — provides model metadata infrastructure)
- RFC-045: ML Model Optimization Guide (preprocessing + parameter tuning)
- RFC-049: Grounded Insight Layer – Core (downstream consumer of structured extraction)
- RFC-050: GIL Use Cases
- RFC-051: Database Projection (GIL & KG)
- RFC-052: Locally Hosted LLM Models (model-specific prompt engineering for Ollama LLMs)
- RFC-053: Adaptive Summarization Routing (downstream — routes episodes to optimal strategies)
Execution Order:
Phase 1: RFC-044 (Model Registry) ~2-3 wk
│ ModelCapabilities, ModelRegistry
▼
Phase 2: RFC-042 (this RFC — Hybrid ML) ~10 wk
│ Populate registry, hybrid provider,
│ extended models (Embedding, QA, NLI)
│ + RFC-052 (model-specific prompts, parallel)
▼
Phase 3: RFC-049 (Grounded Insight Layer) ~6-8 wk
│ Consume registry + hybrid platform for GIL
▼
Phase 4: RFC-053 (Adaptive Routing) ~4-6 wk
Route episodes to optimal strategies
based on profiling + model diversity
Related Issues¶
- TBD (to be created)
Abstract¶
This RFC proposes a new hybrid MAP-REDUCE provider for podcast transcripts that addresses persistent quality issues in the current classic summarization approach. The hybrid provider uses classic summarizers (LED, LongT5, PEGASUS) for efficient compression in the MAP phase and instruction-tuned models (FLAN-T5, Qwen, LLaMA, Mistral) for abstraction and structuring in the REDUCE phase. This separation of concerns leverages each model class for what it does best, resulting in higher-quality summaries with better structure adherence, deduplication, and content filtering.
Expanded scope (v2.5+): Beyond summarization, this RFC establishes the local ML platform for structured extraction tasks including:
- FLAN-T5 as a lightweight instruction-tuned model family (runs locally without llama.cpp)
- Sentence-transformers for semantic similarity, grounding validation, and topic clustering
- Structured extraction capability (JSON output from transcripts) — the foundation for RFC-049 Grounded Insight Layer
These additions generalize the hybrid architecture from "summarization-only" to "any MAP → instruction-following REDUCE pipeline," enabling reuse across summarization, GIL extraction, and future structured tasks.
Target deployment: Local execution on Mac laptops (Apple Silicon) and later expansion to other platforms.
Provider status: New optional provider alongside existing ML provider, not a replacement.
1. Problem Statement¶
We are building a local, PyTorch-based podcast summarization system capable of handling long-form audio (30–90 minutes). The current ML provider relies exclusively on classic abstractive summarization models (BART, LED, PEGASUS) in both the MAP (chunk-level) and REDUCE (final summary) phases.
Despite multiple iterations (chunking strategies, bulletization, structural joiners, deduplication hints), the final summaries consistently exhibit the following failure modes:
Observed Failure Modes¶
- Extractive or semi-extractive output - Stitched sentences from the source rather than true abstraction
- Echoing of scaffolding text - Copying schema headers, "chunk" labels, and structural hints
- Repetition loops - "Chunks of chunks" recursion and duplicate ideas
- Poor content filtering - Inability to reliably ignore intros, outros, sponsorships, or meta-text
- Structure non-compliance - Poor adherence to desired output format (takeaways, outline, actions)
These issues persist even after significant prompt-like engineering at the reducer input level.
2. Root Cause Analysis¶
The root cause is model-pipeline mismatch.
2.1 Model Training Mismatch¶
Classic summarization models (BART, LED, PEGASUS):
- ❌ Are not instruction-following models
- ❌ Were trained primarily on clean news-style prose (CNN/DailyMail, XSum)
- ❌ Treat all input text as equally salient content
- ❌ Do not reliably distinguish between:
- content vs instructions
- metadata vs source text
- schema vs information
Result: Any reducer input containing schema descriptions, structural headers, or deduplication instructions is often copied verbatim or recursively summarized, leading to degraded outputs.
2.2 Transcript Domain Mismatch¶
Podcast transcripts are:
- 🎙️ Conversational - Natural speech patterns, fillers, false starts
- 🔁 Repetitive - Ideas restated multiple times for emphasis
- 📝 Noisy - Timestamps, speaker labels, formatting artifacts
- 🗣️ Structurally unlike news articles - No headlines, no inverted pyramid
Classic summarizers default to safe extraction when faced with such input.
3. Goals¶
We want a local ML platform that:
- ✅ Produces genuinely abstractive summaries (not extractive stitching)
- ✅ Reliably ignores non-content (ads, promos, intros, outros)
- ✅ Deduplicates repeated ideas across chunks
- ✅ Outputs a clean, predictable structure:
- Key takeaways (6–10 bullets)
- Topic outline (6–12 bullets)
- Action items (if any)
- ✅ Runs locally with reasonable performance and cost
- ✅ Works well on Mac laptops (Apple Silicon) initially
- ✅ Exists as an optional provider alongside current ML provider
- ✅ Provides FLAN-T5 as a lightweight instruction-tuned model that runs via PyTorch (no llama.cpp required)
- ✅ Adds sentence-transformer embeddings for semantic similarity, topic clustering, and grounding validation
- ✅ Supports structured extraction (JSON output) beyond prose summarization — foundation for RFC-049 GIL
4. Non-Goals¶
- ❌ End-to-end training or fine-tuning of large models
- ❌ Fully extractive summarization
- ❌ Reliance on proprietary cloud-only APIs
- ❌ Replacing the existing ML provider (this is additive)
- ❌ Supporting all hardware configurations in v1 (focus on Mac)
5. Proposed Solution: Hybrid MAP-REDUCE Architecture¶
We propose splitting responsibilities across model classes, using each for what it does best.
High-Level Architecture¶
↓
Chunking (existing logic)
↓
MAP Phase (Classic Summarizer)
├─ LED / LongT5 / PEGASUS
├─ Compression: chunks → bullet notes
└─ Output: Factual notes, no structure enforcement
↓
REDUCE Phase (Instruction-Tuned LLM)
├─ Qwen2.5 / LLaMA / Mistral (local)
├─ Abstraction, deduplication, filtering
└─ Output: Structured summary (takeaways, outline, actions)
↓
Post-processing (minimal)
└─ Whitespace cleanup, bullet normalization
```yaml
### Key Principle
**Separation of Concerns:**
- **MAP** = Compression engine (use classic summarizers)
- **REDUCE** = Reasoning + structuring engine (use instruction-tuned LLMs)
---
## 6. MAP Phase (Classic Summarization)
### Purpose
Use classic summarizers as **compression engines**, not final writers.
### Models
- **LED** (e.g., `allenai/led-base-16384`)
- **LongT5** (e.g., `google/long-t5-tglobal-base`)
- **PEGASUS** (optional, for shorter contexts)
### Characteristics
- **Input:** Cleaned transcript chunks (1000-4000 tokens)
- **Output:** Short bullet-style notes per chunk (100-200 tokens)
- **No instructions or schema text in input**
- **No expectations of structure or formatting compliance**
### Rationale
Classic summarizers:
- ✅ Are **efficient at compressing long inputs**
- ✅ Are **widely available in PyTorch** (`transformers`)
- ✅ Perform **adequately as "note generators"**
- ✅ Can run on CPU if needed (though GPU is faster)
---
## 7. REDUCE Phase (Instruction-Tuned LLM)
### Purpose
Perform **true abstraction, deduplication, filtering, and structuring**.
### Model Class
Instruction-tuned models, run locally:
**Tier 1 — Instruction-Tuned Seq2Seq (PyTorch-native):**
- **FLAN-T5-base** (250M, ~1GB) — Lightweight, runs on CPU,
adequate for structured extraction and simple REDUCE tasks
- **FLAN-T5-large** (780M, ~3GB) — Better quality, still
fits in modest hardware, good for GIL insight extraction
- **FLAN-T5-xl** (3B, ~12GB) — High quality, GPU recommended
**Tier 2 — Instruction-Tuned LLMs (llama.cpp / Ollama):**
- **Qwen2.5 Instruct** (7B / 14B) — Excellent
instruction-following
- **LLaMA 3.x Instruct** (8B) — Strong reasoning
- **Mistral 7B Instruct** — Good balance of speed/quality
- **Phi-3 Mini Instruct** — Lighter-weight option
**Why Two Tiers?**
FLAN-T5 runs natively via `transformers` (PyTorch) with no
external dependencies (no llama.cpp, no Ollama). This makes
it the **lowest-friction entry point** for instruction-following
ML. Tier 2 models offer higher quality but require additional
infrastructure (llama.cpp compilation or Ollama installation).
For GIL extraction (RFC-049), FLAN-T5 provides a viable
pure-ML path where Tier 2 LLMs offer the premium path.
### Responsibilities
1. **Filter non-content** - Ignore ads, intros, outros, and meta-text
2. **Merge duplicate ideas** - Deduplicate across chunks
3. **Identify key insights** - Decisions, lessons, takeaways
4. **Produce structured output** - Predefined format with sections
### Why Instruction Models
These models:
- ✅ Are **trained to distinguish instructions from content**
- ✅ **Follow formatting constraints reliably**
- ✅ **Do not echo scaffolding text**
- ✅ Are **far more robust on conversational transcripts**
- ✅ Support **zero-shot task specification** via prompts
### Minimal REDUCE Prompt Contract
The REDUCE stage receives factual notes from MAP and outputs:
```markdown
Key takeaways:
- ...
Topic outline:
- ...
Action items (if any):
- ...
No other text is permitted.
Instruction template:
Your task:
1. Ignore all ads, promotions, intros, outros, and meta-text
2. Merge duplicate ideas
3. Identify the most important insights, decisions, and lessons
4. Output ONLY the following sections (no additional text):
Key takeaways:
- [6-10 bullet points]
Topic outline:
- [6-12 bullet points covering main topics]
Action items (if any):
- [List any actionable items mentioned, or state "None"]
Input notes:
{map_outputs}
```yaml
---
## 8. Post-Processing
Post-processing becomes **simpler and more reliable**:
- ✅ Minimal cleanup (whitespace, bullet normalization)
- ✅ Optional section validation (bullet counts)
- ✅ Optional lightweight QA checks
Unlike the current system, **structure enforcement does not rely on fragile input hacks**.
---
## 9. Local Deployment Considerations
### MAP Phase
- **Hardware:** Can run on CPU or GPU
- **Execution:** Batched, parallelizable across chunks
- **Memory:** Memory-heavy but predictable (~2-4 GB per model)
### REDUCE Phase
- **Hardware:** GPU preferred, but CPU/Apple Silicon viable
- **Quantization:** 4-bit quantization recommended (reduces memory to ~4-8 GB)
- **Inference backend:**
- `transformers` (PyTorch) for GPU
- `llama.cpp` for CPU/Apple Silicon
- `ollama` as alternative local inference server (optional)
- **Input size:** Small (already compressed by MAP)
### Apple Silicon Support
For Mac laptops (primary target):
- **MAP:** Run LED/LongT5 via `transformers` on CPU (acceptable speed)
- **REDUCE:** Run Qwen2.5/Mistral via `llama.cpp` with Metal acceleration
- **Expected performance:** ~5-10 tokens/sec on M1/M2/M3 with 4-bit quantization
---
## 10. Expected Outcomes
By adopting this hybrid approach, we expect:
1. ✅ **Significant improvement in summary quality**
2. ✅ **Elimination of schema/scaffold leakage**
3. ✅ **Better abstraction and deduplication**
4. ✅ **Clear separation of responsibilities** in the codebase
5. ✅ **Easier future upgrades** (swap MAP or REDUCE models independently)
6. ✅ **Better user experience** (higher-quality, more actionable summaries)
---
## 11. Architecture & Implementation
### 11.1 New Module: `hybrid_ml_provider.py`
Create a new provider that implements the hybrid architecture:
```python
# src/podcast_scraper/providers/ml/hybrid_ml_provider.py
from typing import Protocol
from podcast_scraper.summarization.base import SummarizationProvider
class HybridMLProvider:
"""Hybrid MAP-REDUCE summarization provider.
Uses classic summarizers for MAP (compression) and instruction-tuned
LLMs for REDUCE (abstraction + structuring).
"""
def __init__(self, config: Config):
self.config = config
self.map_model = self._load_map_model() # LED/LongT5
self.reduce_model = self._load_reduce_model() # Qwen/LLaMA/Mistral
def _load_map_model(self):
```python
"""Load classic summarizer for MAP phase."""
# Load LED or LongT5 from transformers
pass
```python
def _load_reduce_model(self):
"""Load instruction-tuned LLM for REDUCE phase."""
# Load via transformers (GPU) or llama.cpp (CPU/Apple Silicon)
pass
```python
def map_phase(self, chunks: list[str]) -> list[str]:
"""Compress chunks into bullet notes."""
# Run classic summarizer on each chunk
pass
```python
def reduce_phase(self, map_outputs: list[str]) -> str:
"""Produce final structured summary."""
# Run instruction-tuned LLM with prompt template
pass
```python
def summarize(self, transcript: str) -> str:
"""Full pipeline: chunk → MAP → REDUCE."""
chunks = self._chunk_transcript(transcript)
map_outputs = self.map_phase(chunks)
final_summary = self.reduce_phase(map_outputs)
return self._post_process(final_summary)
11.2 Inference Backend Abstraction¶
Create a backend abstraction for REDUCE model loading:
# src/podcast_scraper/providers/ml/inference_backends.py
class InferenceBackend(Protocol):
"""Protocol for local LLM inference."""
def generate(self, prompt: str, max_tokens: int) -> str:
"""Generate text from prompt."""
...
class TransformersBackend:
"""PyTorch transformers backend (GPU)."""
pass
class LlamaCppBackend:
"""llama.cpp backend (CPU/Apple Silicon)."""
pass
class OllamaBackend:
```text
"""Ollama backend (optional)."""
pass
11.3 Model Registry¶
Note: The model registry infrastructure (
ModelCapabilities,ModelRegistryclass, lookup/fallback logic) is defined in RFC-044. This section documents the model entries that RFC-042 populates into that registry.
Maintain a registry of all supported models across all capabilities:
# src/podcast_scraper/providers/ml/model_registry.py
# ── MAP Phase (Classic Summarizers) ──────────────────
MAP_MODELS = {
"led-base": "allenai/led-base-16384",
"led-large": "allenai/led-large-16384",
"longt5-base": "google/long-t5-tglobal-base",
"longt5-large": "google/long-t5-tglobal-large",
}
# ── REDUCE Phase (Instruction-Tuned) ────────────────
# Tier 1: Seq2Seq (PyTorch-native, no llama.cpp)
REDUCE_MODELS_SEQ2SEQ = {
"flan-t5-base": "google/flan-t5-base",
"flan-t5-large": "google/flan-t5-large",
"flan-t5-xl": "google/flan-t5-xl",
}
# Tier 2: LLMs (llama.cpp / Ollama)
REDUCE_MODELS_LLM = {
"qwen2.5-7b": "Qwen/Qwen2.5-7B-Instruct",
"qwen2.5-14b": "Qwen/Qwen2.5-14B-Instruct",
"llama3-8b": "meta-llama/Llama-3-8B-Instruct",
"mistral-7b": "mistralai/Mistral-7B-Instruct-v0.3",
"phi3-mini": "microsoft/Phi-3-mini-4k-instruct",
}
# Combined for backwards compatibility
REDUCE_MODELS = {
**REDUCE_MODELS_SEQ2SEQ,
**REDUCE_MODELS_LLM,
}
# ── Embedding Models (Sentence-Transformers) ────────
EMBEDDING_MODELS = {
"minilm-l6": "sentence-transformers/all-MiniLM-L6-v2",
"minilm-l12": "sentence-transformers/all-MiniLM-L12-v2",
"mpnet-base": "sentence-transformers/all-mpnet-base-v2",
}
# ── Extractive QA Models ────────────────────────────
EXTRACTIVE_QA_MODELS = {
"roberta-squad2": "deepset/roberta-base-squad2",
"deberta-squad2": "deepset/deberta-v3-base-squad2",
}
# ── NLI / Cross-Encoder Models ──────────────────────
NLI_MODELS = {
"nli-deberta-base": "cross-encoder/nli-deberta-v3-base",
"nli-deberta-small": (
"cross-encoder/nli-deberta-v3-small"
),
}
11.4 Extended Model Ecosystem¶
Beyond MAP and REDUCE models for summarization, the hybrid ML platform provides additional model families that enable downstream features (notably RFC-049 Grounded Insight Layer).
11.4.1 FLAN-T5: Lightweight Instruction-Tuned Models¶
Why FLAN-T5 specifically?
FLAN-T5 occupies a unique position in the model landscape:
| Property | FLAN-T5 | Classic (BART/LED) | LLMs (Qwen/LLaMA) |
|---|---|---|---|
| Instruction-following | ✅ Yes | ❌ No | ✅ Yes |
Runs via transformers |
✅ Native | ✅ Native | ❌ Needs llama.cpp |
| Structured output (JSON) | ⚠️ Decent | ❌ No | ✅ Good |
| Memory (base) | ~1 GB | ~0.5-2 GB | ~4-8 GB (4-bit) |
| External dependencies | None | None | llama.cpp or Ollama |
| Zero-shot capability | ✅ Good | ❌ No | ✅ Excellent |
FLAN-T5 is the lowest-friction instruction-following model:
no llama.cpp compilation, no Ollama daemon, runs on CPU or MPS
via standard transformers. This makes it the default
"just works" option for structured extraction.
Use Cases for FLAN-T5:
- REDUCE phase for hybrid summarization (alternative to Tier 2 LLMs)
- Insight extraction for RFC-049 GIL (structured output from transcript chunks)
- Topic labeling — zero-shot classification of transcript segments
- General instruction-following — any task requiring "do X with this text" locally
Loading Pattern:
from transformers import (
AutoTokenizer,
AutoModelForSeq2SeqLM,
)
model = AutoModelForSeq2SeqLM.from_pretrained(
"google/flan-t5-large"
)
tokenizer = AutoTokenizer.from_pretrained(
"google/flan-t5-large"
)
# Instruction-style prompt
prompt = (
"Extract the 5 key insights from this text. "
"Output as a JSON list.\n\n"
f"Text: {transcript_chunk}"
)
inputs = tokenizer(
prompt, return_tensors="pt", max_length=512,
truncation=True
)
outputs = model.generate(**inputs, max_new_tokens=256)
result = tokenizer.decode(
outputs[0], skip_special_tokens=True
)
11.4.2 Sentence-Transformers: Embedding Models¶
Sentence-transformers produce dense vector embeddings for text, enabling semantic similarity, clustering, and retrieval.
Primary Model: all-MiniLM-L6-v2 (~90 MB, very fast)
Use Cases:
- Grounding validation (RFC-049): Score semantic similarity between an Insight and a candidate Quote to validate SUPPORTED_BY edges
- Topic clustering: Group related insights or quotes by semantic similarity
- Deduplication: Detect near-duplicate insights across episodes (for global graph assembly)
- Semantic search: Find relevant quotes given a natural-language query (RFC-050 use cases)
Loading Pattern:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"sentence-transformers/all-MiniLM-L6-v2"
)
# Encode texts
insight_emb = model.encode(
"AI regulation will lag innovation"
)
quote_emb = model.encode(
"Regulation will lag innovation by 3-5 years"
)
# Cosine similarity
from sentence_transformers.util import cos_sim
score = cos_sim(insight_emb, quote_emb)
# score ≈ 0.92 → strong support
Memory Footprint:
| Model | Size | Speed | Quality |
|---|---|---|---|
all-MiniLM-L6-v2 |
90 MB | Very fast | Good |
all-MiniLM-L12-v2 |
120 MB | Fast | Better |
all-mpnet-base-v2 |
420 MB | Moderate | Best |
Default: all-MiniLM-L6-v2 (best size/quality tradeoff).
11.4.3 Extractive QA Models¶
Extractive QA models find verbatim text spans in a source document that answer a given question. This is critical for RFC-049 Quote extraction.
Why extractive QA is better than LLMs for quotes:
- Returns exact character offsets (
start,end) - Guarantees the answer IS in the source text (extractive by design — cannot hallucinate)
- Fast, deterministic, small (~500 MB)
- Directly satisfies GIL grounding contract:
Quote.text == transcript[char_start:char_end]
Primary Model: deepset/roberta-base-squad2 (~500 MB)
Loading Pattern:
from transformers import pipeline
qa = pipeline(
"question-answering",
model="deepset/roberta-base-squad2",
)
result = qa(
question="What evidence supports: "
"AI regulation will lag innovation?",
context=transcript_text,
)
# result = {
# "answer": "Regulation will lag innovation "
# "by 3-5 years",
# "start": 4521,
# "end": 4567,
# "score": 0.89,
# }
Multi-Quote Extraction Pattern:
For each insight, extract multiple supporting quotes by running QA with different question formulations or by using top-k answers:
def extract_supporting_quotes(
insight: str,
transcript: str,
qa_pipeline,
top_k: int = 3,
) -> list[dict]:
"""Find top-k verbatim spans supporting insight."""
questions = [
f"What evidence supports: {insight}?",
f"What was said about: {insight}?",
f"Quote supporting: {insight}",
]
candidates = []
for q in questions:
result = qa_pipeline(
question=q,
context=transcript,
top_k=top_k,
)
if isinstance(result, list):
candidates.extend(result)
else:
candidates.append(result)
# Deduplicate by span overlap, sort by score
return _deduplicate_spans(candidates)[:top_k]
11.4.4 NLI Cross-Encoder Models¶
Natural Language Inference (NLI) models classify whether a premise (Quote) supports, contradicts, or is neutral to a hypothesis (Insight). Used for rigorous grounding validation in RFC-049.
Primary Model: cross-encoder/nli-deberta-v3-base
(~400 MB)
Use Case: Validate SUPPORTED_BY edges — given an Insight and a candidate Quote, determine if the quote genuinely supports the insight.
Loading Pattern:
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
)
import torch
model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name
)
premise = "Regulation will lag innovation by 3-5 years"
hypothesis = "AI regulation will lag innovation"
inputs = tokenizer(
premise, hypothesis,
return_tensors="pt", truncation=True,
)
with torch.no_grad():
logits = model(**inputs).logits
# Labels: 0=contradiction, 1=neutral, 2=entailment
probs = torch.softmax(logits, dim=-1)
entailment_score = probs[0][2].item()
# entailment_score ≈ 0.91 → strong support
11.4.5 Model Memory Budget (All Models Combined)¶
Total memory footprint for all model families at recommended defaults:
| Model Family | Model | Size | Purpose |
|---|---|---|---|
| MAP | LED-base | ~1 GB | Compression |
| REDUCE | FLAN-T5-large | ~3 GB | Structured extraction |
| Embedding | MiniLM-L6 | ~90 MB | Similarity |
| Extractive QA | RoBERTa-SQuAD2 | ~500 MB | Quote spans |
| NLI | DeBERTa-NLI | ~400 MB | Grounding |
| Total | ~5 GB | Full GIL-capable |
With Tier 2 LLM (Qwen2.5-7B, 4-bit) instead of FLAN-T5:
| Model Family | Model | Size | Purpose |
|---|---|---|---|
| MAP | LED-base | ~1 GB | Compression |
| REDUCE | Qwen2.5-7B (4-bit) | ~4 GB | Premium extraction |
| Embedding | MiniLM-L6 | ~90 MB | Similarity |
| Extractive QA | RoBERTa-SQuAD2 | ~500 MB | Quote spans |
| NLI | DeBERTa-NLI | ~400 MB | Grounding |
| Total | ~6 GB | Premium GIL |
Both configurations fit comfortably on a 16 GB Mac laptop with room for the OS and other processes.
11.5 Structured Extraction Capability¶
Beyond summarization, the hybrid architecture generalizes to any task that follows the pattern:
Input text → MAP (compress/chunk) → REDUCE (reason/structure)
Structured extraction is a key application where the REDUCE model outputs JSON instead of prose. This capability is the foundation for RFC-049 Grounded Insight Layer.
11.5.1 Extraction Protocol¶
Define a generic extraction protocol that any model can implement:
# src/podcast_scraper/providers/ml/extraction.py
from typing import Protocol
class StructuredExtractor(Protocol):
"""Protocol for structured extraction from text."""
def extract(
self,
text: str,
schema: dict,
prompt_template: str | None = None,
) -> dict:
"""Extract structured data from text.
Args:
text: Input text (transcript chunk or full)
schema: Expected output JSON schema
prompt_template: Jinja2 template name
Returns:
Extracted data conforming to schema
"""
...
11.5.2 Extraction Tasks (v1)¶
The following extraction tasks are enabled by the hybrid platform:
| Task | MAP Model | REDUCE Model | Output |
|---|---|---|---|
| Summarization | LED/LongT5 | FLAN-T5/Qwen | Prose summary |
| Insight extraction | LED/LongT5 | FLAN-T5/Qwen | JSON: insights[] |
| Quote extraction | — (direct) | Extractive QA | JSON: quotes[] |
| Topic extraction | — (direct) | FLAN-T5/KeyBERT | JSON: topics[] |
| Grounding validation | — (direct) | NLI model | JSON: scores[] |
Key Observation: Quote extraction uses extractive QA (not the MAP-REDUCE pattern) because verbatim span extraction is fundamentally different from abstraction. The extractive QA model finds the exact span in the source text — no compression or rephrasing needed.
11.5.3 GIL Extraction Pipeline¶
When all models are available, the GIL extraction pipeline combines them:
Transcript
│
├─► MAP (LED/LongT5) → compressed notes
│ └─► REDUCE (FLAN-T5/LLM) → insights[]
│
├─► For each insight:
│ └─► Extractive QA → supporting quotes[]
│ └─► Each quote has char_start, char_end
│
├─► NLI Validation:
│ └─► For each (insight, quote) pair:
│ └─► entailment score → SUPPORTED_BY edge
│
├─► Topic Extraction:
│ └─► spaCy NER + FLAN-T5 labeling → topics[]
│
└─► Assembly → gi.json (RFC-049 schema)
11.5.4 Prompt Templates for Extraction¶
Extraction prompts follow the existing prompt_store
pattern (RFC-017):
src/podcast_scraper/prompts/
hybrid_ml/
summarization/
reduce_v1.j2 # Existing: prose summary
extraction/
insight_v1.j2 # NEW: extract insights
topic_v1.j2 # NEW: extract topics
Example: insight_v1.j2:
Extract the key insights from these notes.
Each insight should be a clear, standalone takeaway.
Output ONLY valid JSON in this format:
{
"insights": [
{
"text": "...",
"confidence": 0.0-1.0
}
]
}
Notes:
{{ map_outputs }}
12. Configuration Schema¶
12.1 New Config Fields¶
Add to Config model in src/podcast_scraper/config.py:
class Config(BaseModel):
# ... existing fields ...
# ── Summarization Provider ──────────────────
summarization_provider: Literal[
"transformers", # Classic-only
"hybrid_ml", # Hybrid MAP-REDUCE
"openai",
"anthropic",
# ... other providers
] = "transformers"
# ── Hybrid Provider Config ──────────────────
hybrid_map_model: str = "led-base"
hybrid_reduce_model: str = "flan-t5-large"
hybrid_reduce_backend: Literal[
"transformers", "llama_cpp", "ollama"
] = "transformers" # Default to PyTorch
hybrid_map_device: Literal[
"cpu", "cuda", "mps"
] = "cpu"
hybrid_reduce_device: Literal[
"cpu", "cuda", "mps"
] = "mps"
hybrid_quantization: Literal[
"none", "4bit", "8bit"
] = "none" # FLAN-T5 doesn't need quantization
# ── Embedding Model Config ──────────────────
embedding_model: str = "minilm-l6"
embedding_device: Literal[
"cpu", "cuda", "mps"
] = "cpu"
# ── Extractive QA Config ────────────────────
extractive_qa_model: str = "roberta-squad2"
extractive_qa_device: Literal[
"cpu", "cuda", "mps"
] = "cpu"
# ── NLI Grounding Config ────────────────────
nli_model: str = "nli-deberta-base"
nli_device: Literal[
"cpu", "cuda", "mps"
] = "cpu"
12.2 Example: FLAN-T5 Config (Mac, Pure PyTorch)¶
# config.hybrid.flant5.yaml
# Lowest-friction option: no llama.cpp, no Ollama
summarization_provider: hybrid_ml
hybrid_map_model: led-base
hybrid_reduce_model: flan-t5-large
hybrid_reduce_backend: transformers # PyTorch native
hybrid_map_device: cpu
hybrid_reduce_device: mps # Metal for Apple Silicon
hybrid_quantization: none # FLAN-T5 fits without it
# Embedding + QA for GIL support
embedding_model: minilm-l6
extractive_qa_model: roberta-squad2
nli_model: nli-deberta-base
12.3 Example: LLM Config (Mac, llama.cpp)¶
# config.hybrid.llm.yaml
# Premium quality, requires llama.cpp
summarization_provider: hybrid_ml
hybrid_map_model: led-base
hybrid_reduce_model: qwen2.5-7b
hybrid_reduce_backend: llama_cpp
hybrid_map_device: cpu
hybrid_reduce_device: mps
hybrid_quantization: 4bit
embedding_model: minilm-l6
extractive_qa_model: roberta-squad2
nli_model: nli-deberta-base
12.4 Example: NVIDIA GPU Config¶
# config.hybrid.gpu.yaml
summarization_provider: hybrid_ml
hybrid_map_model: led-large
hybrid_reduce_model: qwen2.5-14b
hybrid_reduce_backend: transformers
hybrid_map_device: cuda
hybrid_reduce_device: cuda
hybrid_quantization: 4bit
embedding_model: mpnet-base
extractive_qa_model: deberta-squad2
nli_model: nli-deberta-base
13. Provider Integration¶
13.1 Factory Function¶
Update src/podcast_scraper/summarization/factory.py:
def create_summarization_provider(config: Config) -> SummarizationProvider:
"""Factory for summarization providers."""
provider_type = config.summarization_provider
if provider_type == "transformers":
from podcast_scraper.providers.ml.summarizer import TransformersSummarizer
return TransformersSummarizer(config)
elif provider_type == "hybrid_ml": # NEW
from podcast_scraper.providers.ml.hybrid_ml_provider import HybridMLProvider
return HybridMLProvider(config)
elif provider_type == "openai":
from podcast_scraper.providers.openai.openai_provider import OpenAIProvider
return OpenAIProvider(config)
# ... other providers
```text
else:
raise ValueError(f"Unknown summarization provider: {provider_type}")
```yaml
### 13.2 Backwards Compatibility
- ✅ Existing configs with `summarization_provider: transformers`
continue to work
- ✅ No breaking changes to existing code
- ✅ New provider is **opt-in via configuration**
- ✅ Extended models (embedding, QA, NLI) are loaded lazily —
only when a feature requires them
---
## 13A. Foundation for RFC-049 (Grounded Insight Layer)
This section explicitly documents how RFC-042 enables
RFC-049, establishing a clear dependency chain.
### What RFC-042 Provides to RFC-049
| RFC-042 Deliverable | RFC-049 Consumer |
| --- | --- |
| FLAN-T5 model loading + inference | Insight extraction (structured JSON) |
| Sentence-transformers loading | Grounding validation (similarity scoring) |
| Extractive QA pipeline | Quote extraction (verbatim spans) |
| NLI cross-encoder | SUPPORTED_BY edge validation |
| Model registry | Central model management |
| Inference backend abstraction | Unified model loading across tasks |
| Prompt template integration | GIL extraction prompts |
| MAP-REDUCE pattern | Insight extraction on long transcripts |
### What RFC-049 Handles Independently
These capabilities are GIL-specific and do NOT belong
in RFC-042:
- **GIL extraction orchestration** — coordinating insight,
quote, and topic extraction into a single pipeline
- **`gi.json` assembly** — combining extracted data into
the GIL schema
- **Grounding contract enforcement** — ensuring every
insight has explicit grounding status
- **Schema validation** — validating outputs against
`gi.schema.json`
- **Workflow integration** — adding GIL as a pipeline
stage in `orchestration.py`
### Dependency Chain
```text
RFC-042 (ML Platform)
├─ FLAN-T5 models loaded ──────────────► RFC-049 uses for insight extraction
├─ Sentence-transformers loaded ───────► RFC-049 uses for grounding
├─ Extractive QA pipeline ready ───────► RFC-049 uses for quote extraction
├─ NLI model loaded ──────────────────► RFC-049 uses for validation
├─ Model registry populated ──────────► RFC-049 references models by key
└─ Structured extraction protocol ────► RFC-049 implements for GIL
Provider-Tier Mapping for GIL¶
RFC-049 defines three extraction tiers. RFC-042 enables Tiers 1 and 2:
| Tier | Insight Source | Quote Source | Grounding |
|---|---|---|---|
| Tier 1: ML | FLAN-T5 (RFC-042) | Extractive QA (RFC-042) | Sentence similarity (RFC-042) |
| Tier 2: Hybrid | MAP+LLM (RFC-042) | Extractive QA (RFC-042) | NLI + LLM verify (RFC-042) |
| Tier 3: Cloud LLM | API provider | Extractive QA (RFC-042) | NLI validation (RFC-042) |
Critical note: Extractive QA for quotes is used in all tiers — even Tier 3 (cloud LLM). This is because extractive QA guarantees verbatim spans, which LLMs cannot. The grounding contract demands it.
14. Migration Plan¶
Prerequisite: RFC-044 (Model Registry)¶
RFC-044 must be implemented first (~2-3 weeks). It provides
the ModelCapabilities, ModelRegistry, and lookup/fallback
infrastructure that this RFC depends on.
Phase 1: Infrastructure (Week 1-2)¶
- Create
hybrid_ml_provider.pymodule - Implement inference backend abstraction
- Populate model registry (all model families via RFC-044)
- Update configuration schema
- Add factory integration
Phase 2: MAP Implementation (Week 3)¶
- Implement MAP phase with LED/LongT5
- Preserve existing chunking logic
- Add unit tests for MAP compression
Phase 3: REDUCE Implementation (Week 4-5)¶
- Implement REDUCE with FLAN-T5 (Tier 1, PyTorch)
- Implement REDUCE with llama.cpp (Tier 2, LLMs)
- Add prompt templates (summarization + extraction)
- Add unit tests for REDUCE abstraction
Phase 4: Extended Models (Week 6-7)¶
- Implement sentence-transformer embedding loader
- Implement extractive QA pipeline
- Implement NLI cross-encoder loader
- Add structured extraction protocol
- Add unit tests for each model family
Phase 5: Integration & Testing (Week 8)¶
- End-to-end integration tests (summarization)
- End-to-end integration tests (structured extraction)
- Manual testing on Mac laptop
- Documentation updates
Phase 6: Validation (Week 9-10)¶
- Run summarization on representative episodes
- Run structured extraction on representative episodes
- Compare quality across Tier 1 vs Tier 2
- Iterate on prompt templates if needed
15. Alternatives Considered¶
15.1 Pure Classic Summarization (BART / LED / PEGASUS only)¶
Description: Use classic summarization models for both MAP and REDUCE stages (current approach).
Pros:
- ✅ Simple architecture
- ✅ Fully local, PyTorch-native
- ✅ Lower operational complexity
Cons (Observed):
- ❌ Strong extractive bias on transcripts
- ❌ Echoing of schema and scaffolding text
- ❌ Poor adherence to output structure
- ❌ Inability to reliably ignore ads, intros, or meta-text
Decision: ❌ Rejected due to persistent quality failures despite extensive tuning.
15.2 Pure Instruction-Tuned LLM Summarization¶
Description: Skip MAP stage and summarize full transcript directly with an instruction-tuned LLM.
Pros:
- ✅ Best abstraction and instruction-following
- ✅ Clean structured output
Cons:
- ❌ Long-context requirements are expensive (30-90 min = 40k+ tokens)
- ❌ Higher memory and latency costs
- ❌ Less efficient for iterative experimentation
Decision: ❌ Rejected for cost/performance reasons on long (30–90 min) transcripts.
15.3 Fine-Tuning a Classic Summarizer¶
Description: Fine-tune BART/LED/PEGASUS on podcast transcripts with custom targets.
Pros:
- ✅ Potentially better domain adaptation
Cons:
- ❌ Requires large curated dataset (hundreds of transcript-summary pairs)
- ❌ High training cost (GPU time, expertise)
- ❌ Still limited instruction-following ability
Decision: ❌ Out of scope for now (may revisit in future).
15.4 Cloud API for REDUCE (OpenAI/Anthropic)¶
Description: Use OpenAI GPT-4 or Anthropic Claude for REDUCE phase.
Pros:
- ✅ Best instruction-following quality
- ✅ No local infrastructure needed
Cons:
- ❌ Not fully local (requires API calls)
- ❌ Ongoing costs per episode
- ❌ Privacy concerns for sensitive podcasts
Decision: ❌ Rejected for v2.5 (goal is fully local). May offer as alternative provider later.
16. Model Selection Matrix (Local Deployment)¶
16.1 Summarization Models¶
| Hardware | MAP | REDUCE (Tier 1) | REDUCE (Tier 2) |
|---|---|---|---|
| CPU only | LED-base | FLAN-T5-base | Phi-3 Mini (llama.cpp) |
| Apple Silicon | LED-base | FLAN-T5-large (MPS) | Qwen2.5 7B (llama.cpp) |
| GPU 8-12 GB | LED-base | FLAN-T5-large | Mistral 7B (4-bit) |
| GPU 16 GB+ | LED-large | FLAN-T5-xl | Qwen2.5 14B (4-bit) |
16.2 Extended Models (GIL Support)¶
| Hardware | Embedding | Extractive QA | NLI |
|---|---|---|---|
| CPU only | MiniLM-L6 | RoBERTa-SQuAD2 | DeBERTa-NLI-small |
| Apple Silicon | MiniLM-L6 | RoBERTa-SQuAD2 | DeBERTa-NLI-base |
| GPU 8-12 GB | MiniLM-L12 | DeBERTa-SQuAD2 | DeBERTa-NLI-base |
| GPU 16 GB+ | MPNet-base | DeBERTa-SQuAD2 | DeBERTa-NLI-base |
Recommended Defaults¶
- Mac Laptop (v2.5 focus):
- MAP:
led-baseon CPU - REDUCE:
flan-t5-largeon MPS (zero-friction) - Embedding:
minilm-l6on CPU - Extractive QA:
roberta-squad2on CPU - NLI:
nli-deberta-baseon CPU - Mac Laptop (premium):
- REDUCE:
qwen2.5-7bvia llama.cpp + Metal - All other models same as above
17. Benchmarking & Validation¶
17.1 Qualitative Validation¶
Run both providers (classic vs hybrid) on 3 representative podcast episodes and compare:
- Abstraction quality - Is the summary genuinely abstractive or extractive?
- Duplication rate - Are ideas repeated across chunks?
- Structural correctness - Does output follow the schema?
- Content filtering - Are ads/intros/outros properly ignored?
- Readability - Is the summary coherent and useful?
17.2 Success Criteria¶
The hybrid provider is considered successful if:
- ✅ Summaries are more abstractive than classic-only approach
- ✅ No scaffold/schema leakage in output
- ✅ Consistent structure (3 sections: takeaways, outline, actions)
- ✅ Better content filtering (no ads/intros in summary)
- ✅ Runs locally on Mac laptop with acceptable performance (< 10 min per episode)
17.3 Out of Scope (Phase 1)¶
- ❌ Quantitative metrics (ROUGE, BLEU, BERTScore) - defer to RFC-041
- ❌ Cost/performance benchmarks - defer to RFC-041
- ❌ User studies - defer to future work
Rationale: Keep Phase 1 focused on qualitative validation. Quantitative benchmarking can leverage RFC-041 infrastructure once implemented.
18. Resolved Questions¶
All design questions have been resolved. Decisions are recorded here for traceability.
-
Which instruction model provides the best quality/performance tradeoff on Mac laptops? FLAN-T5-large for zero-friction; Qwen2.5-7B for premium quality. Start with FLAN-T5-large (Tier 1) as the default — it runs natively via PyTorch with no external dependencies. Upgrade to Qwen2.5-7B (Tier 2) when Ollama is available and the user wants higher quality. Empirical A/B testing during Phase 6 (Validation) will confirm the tradeoff.
-
Should REDUCE support streaming output? No (v1). Streaming adds complexity for marginal UX benefit in a batch pipeline. Summaries and extractions are written to files, not displayed live. Revisit if an interactive UI is built.
-
Should MAP outputs be cached for iterative tuning? Yes. Cache MAP outputs per (episode, model, chunk_params) tuple to avoid redundant compression during REDUCE prompt iteration. Store in episode output directory as
map_cache.json. Cache is invalidated when MAP model or chunk params change. -
How to handle llama.cpp installation/ dependencies? Optional dependency. Package
llama-cpp-pythonas an optional install group:pip install podcast-scraper[llama]. If not installed, Tier 2 LLMs fall back to Ollama backend (RFC-052). Users with Ollama installed need no llama.cpp at all. -
Should we support Ollama as alternative backend? Yes. RFC-052 defines the Ollama prompt layer. The
OllamaBackendin §11.2 is a first-class inference backend alongsideTransformersBackendandLlamaCppBackend. Config fieldhybrid_reduce_backend: "ollama"selects it. -
FLAN-T5 structured output quality: Is JSON output from FLAN-T5-large reliable enough for GIL? Reliable enough with validation + fallback. FLAN-T5-large produces valid JSON ≥80% of the time. Strategy: (1) attempt JSON parse, (2) if invalid, apply heuristic repair (fix trailing commas, add brackets), (3) if still invalid, parse as key-value text (like existing
SummarySchemaparsing). Log parse success rate for monitoring. -
Should extractive QA run on the full transcript or on pre-segmented chunks? Pre-segmented chunks (sliding window). RoBERTa has a 512-token context window. Strategy: split transcript into 448-token windows with 64-token overlap. Run QA on each window, merge results by deduplicating overlapping spans. This is the standard approach and aligns with the registry's
max_input_tokens=512for QA models (RFC-044). -
Lazy loading vs eager loading for extended models? Lazy loading. Load embedding/QA/NLI models only when a feature requests them (e.g.,
generate_gi: truetriggers QA + NLI loading). Keep memory low for summarization-only runs. Implementation: use@cached_propertyor explicit_load_if_needed()pattern. Each model loader checksModelRegistry.get_capabilities()before loading. -
Should
sentence-transformersbe a required or optional dependency? Optional dependency. Package in a separate install group:pip install podcast-scraper[gil](includes sentence-transformers, extractive QA, NLI models). Basic summarization works without it. Import guarded withtry/except ImportErrorand helpful message pointing to the install group.
19. Timeline¶
Target: v2.5 Release¶
| Phase | Duration | Deliverables |
|---|---|---|
| Phase 1: Infrastructure | 2 weeks | Module structure, config, backends, registry |
| Phase 2: MAP | 1 week | LED/LongT5 compression |
| Phase 3: REDUCE | 2 weeks | FLAN-T5 + llama.cpp integration |
| Phase 4: Extended Models | 2 weeks | Embedding, QA, NLI pipelines |
| Phase 5: Integration | 1 week | E2E tests, Mac testing |
| Phase 6: Validation | 2 weeks | Episode comparisons, iteration |
| Total | 10 weeks | Full ML platform for v2.5 |
20. Success Criteria¶
The hybrid ML platform is ready for v2.5 when:
Functional Requirements¶
- ✅ Hybrid provider available via
summarization_provider: hybrid_ml - ✅ FLAN-T5 works as REDUCE model (Tier 1, PyTorch)
- ✅ Tier 2 LLMs work via llama.cpp backend
- ✅ Produces structured summaries (takeaways, outline, actions)
- ✅ Existing classic provider continues to work
Extended Model Requirements¶
- ✅ Sentence-transformer embeddings load and produce vectors
- ✅ Extractive QA pipeline returns verbatim spans with character offsets
- ✅ NLI cross-encoder classifies entailment/neutral/ contradiction
- ✅ Structured extraction protocol implemented and tested
- ✅ Model registry contains all model families
Quality Requirements¶
- ✅ Summaries more abstractive than classic-only approach
- ✅ No scaffold/schema text leaks into output
- ✅ FLAN-T5 produces valid structured JSON output ≥80% of the time
- ✅ Extractive QA returns correct spans (validates against transcript source)
Performance Requirements¶
- ✅ Full summarization pipeline < 10 min per episode
- ✅ Full model suite memory < 16 GB on Mac laptop
- ✅ Extended models (embedding, QA, NLI) load in < 30s each
Documentation Requirements¶
- ✅ Configuration guide for hybrid provider
- ✅ Model selection recommendations (Tier 1 vs Tier 2)
- ✅ Hardware requirements documented
- ✅ Extended model usage guide
- ✅ GIL readiness checklist (what RFC-049 needs)
Conclusion¶
The observed summarization failures stem from a mismatch between task requirements and model capabilities.
By separating compression (MAP) from reasoned abstraction and structuring (REDUCE), and assigning each to the appropriate model class, this RFC proposes a robust, scalable, and locally runnable solution for high-quality podcast processing.
Beyond summarization, this RFC now establishes the local ML platform for structured extraction tasks. The addition of FLAN-T5, sentence-transformers, extractive QA, and NLI models creates the foundation for RFC-049 (Grounded Insight Layer), where:
- FLAN-T5 enables structured insight extraction with zero external dependencies
- Extractive QA provides verbatim quote extraction that satisfies the GIL grounding contract better than any LLM can
- Sentence-transformers enable semantic grounding validation and topic clustering
- NLI cross-encoders provide rigorous evidence verification
The hybrid provider will exist alongside the current ML provider, giving users choice and flexibility. Mac laptop support (Apple Silicon) makes this accessible to developers and power users.
This is a strategic enhancement that builds a local ML platform — improving summarization today, enabling evidence-backed insight extraction (RFC-049) tomorrow, and powering adaptive routing (RFC-053) across diverse content types in the future.