RFC-045: ML Model Optimization Guide¶
- Status: Completed
- Authors: Marko Dragoljević
- Stakeholders: ML/AI Engineers, Developers using podcast_scraper
- Related PRDs:
docs/prd/PRD-005-summarization.md(Summarization feature)docs/prd/PRD-007-ai-experiments.md(AI experiment pipeline)- Related ADRs:
docs/adr/ADR-017-registered-preprocessing-profiles.md(Preprocessing profiles)- Related RFCs:
docs/rfc/RFC-012-episode-summarization.md(Episode summarization)docs/rfc/RFC-041-podcast-ml-benchmarking-framework.md(ML benchmarking)- Related Documents:
docs/wip/preprocessing_improvements_plan.md(Source document)docs/wip/baseline_bart_experiment_plan.md(Source document)
Abstract¶
This RFC provides a comprehensive guide for maximizing ML model quality in podcast_scraper through two orthogonal approaches: preprocessing optimization (model-agnostic text cleaning) and generation parameter tuning (model-specific settings). The goal is to extract maximum quality from smaller, faster models before graduating to larger, more expensive alternatives.
Architecture Alignment: This RFC builds on RFC-016's modular provider architecture and RFC-041's benchmarking framework, providing concrete optimization strategies that apply across all ML providers.
Problem Statement¶
First Principle: Generic CNN Summarizers Hate Raw Dialogue¶
Models like facebook/bart-base are not trained to summarize messy dialogue transcripts. They were trained on news articles, Wikipedia, and other well-structured prose. When you feed them:
- Speaker turns (
Maya: ...,Liam: ...) - Interruptions and overlapping speech
- Filler words (
uh,um,you know,like) - Stage directions (
(Pause),(Laughter),[music]) - Transcription artifacts (
/////,=-,Desc-)
...they will often produce exactly what we're seeing: repetition loops + label hallucinations + garbage output.
This is not a bug—it's a fundamental mismatch between training data and input format.
The Core Question¶
Before touching parameters, you must decide:
Do you want your summarizer to work on dialogue transcripts?
If yes, you must either:
- Preprocess dialogue into a cleaner format (this RFC's primary approach), or
- Use a dialogue-tuned summarization model (future consideration)
Observed Quality Issues¶
Current baseline experiments reveal significant quality issues:
| Issue | Baseline Value | Impact |
|---|---|---|
| Speaker Name Leak Rate | 80% | Names appear in summaries |
| "Too Short" Warnings | 5/5 episodes | Incomplete summaries |
| Repetitive Output | Present | Garbage artifacts (=-, ////////) |
| Failed Episodes | 20% | Pipeline failures |
Root Causes¶
These issues stem from two orthogonal root causes that must both be addressed:
- Inadequate Preprocessing (Step A)
- Current
cleaning_v3preserves speaker names likeMaya:andLiam:that leak into outputs - Junk lines (punctuation artifacts, stage directions) confuse the model
-
Name:format triggers verbatim copying behavior -
Suboptimal Generation Parameters (Step B)
- Default settings prioritize speed over quality
- No anti-repetition controls enabled
- Length constraints too restrictive for podcast content
Why Both Steps Are Required¶
| If You Only Do... | Result |
|---|---|
| Step A (preprocessing) only | Cleaner input, but model may still produce short/repetitive output |
| Step B (parameters) only | Better generation behavior, but garbage-in-garbage-out |
| Both A + B | Clean input + constrained generation = quality output |
Without addressing both fronts, users will either:
- Waste compute on larger models that don't solve preprocessing issues
- Miss easy quality wins from parameter tuning
- Chase phantom bugs that are actually input format problems
Use Cases¶
- Resource-Constrained Deployment: Maximize quality from
bart-small+led-basebefore requiring larger models - Quality Debugging: Systematically identify whether issues are preprocessing or model-related
- Baseline Establishment: Create reproducible quality benchmarks for model comparisons
- Dialogue-to-Prose Conversion: Transform raw transcripts into summarizer-friendly format
Architecture Context¶
Big Picture: Three Processing Flows¶
The podcast_scraper system has three interconnected processing flows. Understanding where preprocessing fits is essential for effective optimization:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ PODCAST SCRAPER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ FLOW 1: PRODUCTION PIPELINE (workflow/orchestration.py) │
│ ═══════════════════════════════════════════════════════ │
│ │
│ RSS Feed ──► Episode ──► Download/ ──► Metadata ──► Summarization │
│ Metadata Transcription Generation │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ [Whisper] [metadata.json] [summary.txt] │
│ │ │ │
│ ▼ │ │
│ transcript.txt ───────────────────────────►│ │
│ │ │
│ PREPROCESSING HAPPENS HERE │ │
│ (inside summarizer at runtime) │ │
│ │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ FLOW 2: EVALUATION PIPELINE (scripts/eval/run_experiment.py) │
│ ═══════════════════════════════════════════════════════════ │
│ │
│ experiment.yaml ──► Load Dataset ──► For Each Episode ──► provider.summarize() │
│ │ │ │ │ │
│ │ ▼ ▼ ▼ │
│ │ curated_5feeds_ raw transcript Same cleanup applies │
│ │ smoke_v1.json file.txt (inside model) │
│ │ │ │
│ └──────────────── preprocessing_profile: cleaning_v3 ───────►│ │
│ │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ FLOW 3: PREPROCESSING SUBSYSTEM (preprocessing/core.py, profiles.py) │
│ ═════════════════════════════════════════════════════════════════════ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ PREPROCESSING PROFILES │ │
│ │ │ │
│ │ cleaning_none ──► (passthrough) │ │
│ │ cleaning_v1 ────► Basic: timestamps, speakers, blanks │ │
│ │ cleaning_v2 ────► v1 + sponsor blocks + outro blocks │ │
│ │ cleaning_v3 ────► v2 + credits + garbage lines + artifacts │◄─ DEFAULT │
│ │ cleaning_v4 ────► v3 + speaker anonymization + header stripping │◄─ PROPOSED│
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
Preprocessing is a Horizontal Concern¶
Preprocessing isn't a single stage in one pipeline—it's a horizontal concern that cuts across multiple flows:
| Flow | Where Preprocessing Happens | Profile Used | Configurable? |
|---|---|---|---|
| Production Pipeline | Inside summarize_long_text() at runtime |
cleaning_v3 (via apply_profile) |
✅ Yes |
| Eval Pipeline | Inside provider's summarize() at runtime |
From experiment config | ✅ Yes |
| Offline Cleaning | clean_for_summarization() for .cleaned.txt |
cleaning_v3 equivalent |
❌ No |
Three Layers of Processing¶
┌─────────────────────────────────────────────────────────────────────────┐
│ THREE LAYERS OF PROCESSING │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: ACQUISITION │
│ ───────────────────── │
│ RSS → Download → Transcribe → Save transcript.txt │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ LAYER 2: PREPROCESSING (horizontal concern) │
│ ─────────────────────────────────────────── │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ PREPROCESSING PROFILE SYSTEM │ │
│ │ │ │
│ │ Input ──► apply_profile(text, "cleaning_v4") ──► Output │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ strip_episode_header → strip_credits → strip_garbage │ │ │
│ │ │ → clean_transcript → anonymize_speakers → remove_sponsor │ │ │
│ │ │ → remove_outro → remove_artifacts → filter_junk_lines │ │ │
│ │ └──────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ ═══════════════════════════════════════════════════════════════════ │
│ │
│ LAYER 3: ML INFERENCE │
│ ───────────────────── │
│ Cleaned text ──► MAP (BART) ──► REDUCE (LED) ──► Final Summary │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The Fundamental Problem: Late-Stage Preprocessing Can't Fix Early-Stage Input¶
RAW TRANSCRIPT (what models currently receive):
┌─────────────────────────────────────────────────────────────┐
│ Maya: Hi, Maya. I'm here to talk about building trails that │
│ last. What do you recommend? │
│ Liam: We're talking about riding trails that are rideable │
│ longer, and the soil structure stays intact instead │
│ of turning to ruts. │
│ Maya: And what do you teach that to someone new? │
└─────────────────────────────────────────────────────────────┘
│
▼
AFTER cleaning_v3 (current):
┌─────────────────────────────────────────────────────────────┐
│ Maya: Hi, Maya. I'm here to talk about building trails that │ ← Still dialogue!
│ last. What do you recommend? │ ← Still speaker names!
│ Liam: We're talking about riding trails... │ ← Still Q&A format!
│ Maya: And what do you teach that to someone new? │
└─────────────────────────────────────────────────────────────┘
│
▼
MODEL OUTPUT (garbage):
┌─────────────────────────────────────────────────────────────┐
│ Maya: Hi, Maya. Hi, Maya... What do you recommend? │
│ \\\\\\\\\\\\\\\\ │
│ Liam says Right. Weap- Atras- Athens- │
└─────────────────────────────────────────────────────────────┘
AFTER cleaning_v4 (proposed):
┌─────────────────────────────────────────────────────────────┐
│ A: Hi. I'm here to talk about building trails that last. │ ← Anonymized!
│ What do you recommend? │
│ B: We're talking about riding trails that are rideable │ ← No name leak!
│ longer, and the soil structure stays intact instead │
│ of turning to ruts. │
│ A: And what do you teach that to someone new? │
└─────────────────────────────────────────────────────────────┘
│
▼
MODEL OUTPUT (improved):
┌─────────────────────────────────────────────────────────────┐
│ Building trails that last requires attention to soil │
│ structure and drainage. Rideable trails maintain their │
│ structure instead of turning to ruts... │
└─────────────────────────────────────────────────────────────┘
Key Insight: cleaning_v3 doesn't address speaker names, episode headers, or junk line patterns—the primary causes of the 80% speaker leak rate observed in experiments.
Goals¶
- Eliminate Speaker Name Leakage: Reduce from 80% to <10% through preprocessing
- Improve Output Completeness: Eliminate "too short" warnings through parameter tuning
- Reduce Repetition Artifacts: Use n-gram blocking and repetition penalties
- Establish Model-Agnostic Best Practices: Preprocessing improvements that benefit all models
- Document Parameter Sensitivity: Identify which parameters have highest quality impact
Constraints & Assumptions¶
Constraints:
- Preprocessing must not destroy semantic content
- Parameter changes must be testable via experiment configs
- Changes must be backward compatible with existing pipelines
- No new external dependencies
Assumptions:
- Synthetic test transcripts have intentional repetition (affects baseline interpretation)
- Speaker anonymization acceptable for summarization use case
- Faster models (BART-small) are preferred over larger models when quality is comparable
Design & Implementation¶
1. Preprocessing Optimization (Step A)¶
Fix the input so BART/LED isn't doomed from the start. These improvements help all models (BART, LED, T5, OpenAI) by providing cleaner input. This alone often converts "garbage" into "usable".
1.1 Normalize Transcript into "Dialogue-Safe" Text¶
Apply these deterministic transforms before the Map stage:
| Transform | What It Does | Why It Helps |
|---|---|---|
| Remove junk lines | Strips ////, ====, ---, =- |
Prevents garbage in output |
| Remove stage directions | Strips (Pause), [music], Desc- |
Removes non-content |
| Collapse whitespace | \n\n\n → \n\n |
Cleaner token boundaries |
| Limit repeated phrases | Deduplicate obvious repetition | Prevents copying loops |
Junk Line Detection:
def is_junk_line(line: str) -> bool:
"""Detect lines dominated by punctuation or artifacts."""
stripped = line.strip()
if not stripped:
return False
# Lines that are mostly punctuation
punct_ratio = sum(1 for c in stripped if c in '=/\\-_|') / len(stripped)
if punct_ratio > 0.5:
return True
# Known artifacts
artifacts = ['Desc-', '(Pause)', '[music]', 'subscribe', '////']
return any(art.lower() in stripped.lower() for art in artifacts)
1.2 Speaker Anonymization (HIGH PRIORITY)¶
Problem: Maya: and Liam: labels leak into summaries because normalize_speakers only removes generic labels (Host:, Speaker 1:).
Solution: Replace real speaker names with anonymous labels:
def anonymize_speakers(text: str) -> str:
"""Replace speaker names with anonymous labels (A:, B:, C:...).
Args:
text: Transcript text with speaker labels like "Maya: Hello"
Returns:
Text with anonymized labels like "A: Hello"
Example:
>>> anonymize_speakers("Maya: Hi there\\nLiam: Hello")
'A: Hi there\\nB: Hello'
"""
speaker_pattern = r"^([A-Z][a-z]+(?:\s[A-Z][a-z]+)?)(?:\s*\([^)]+\))?\s*:"
speakers_seen = {}
lines = text.splitlines()
result = []
for line in lines:
match = re.match(speaker_pattern, line)
if match:
name = match.group(1)
if name not in speakers_seen:
speakers_seen[name] = chr(ord('A') + len(speakers_seen))
anon = speakers_seen[name]
line = re.sub(speaker_pattern, f"{anon}:", line)
result.append(line)
return "\n".join(result)
Expected Impact: Speaker leak rate 80% → <10%
1.3 Why Speaker Labels Trigger Copying¶
Small CNN models tend to copy Name: patterns verbatim. This is because:
- Pattern Recognition:
Name:at line start looks like a label the model should preserve - Training Data: News articles don't have dialogue turns, so models weren't trained to ignore them
- Attention Mechanism: Capitalized names at predictable positions get high attention scores
The Key Insight: If you do only one preprocessing step, do speaker anonymization. It addresses the most common failure mode.
Alternative formats that work better (if anonymization isn't enough):
# Instead of:
Maya: Welcome back to the show.
Liam: Thanks for having me.
# Convert to turn-based:
Turn 1: Welcome back to the show.
Turn 2: Thanks for having me.
# Or prose-style (more aggressive):
The host welcomed listeners back to the show.
The guest expressed gratitude for the invitation.
The A:, B: format is a good middle ground—it preserves turn structure while removing name-copying triggers.
1.4 Episode Header Stripping (HIGH PRIORITY)¶
Problem: BART copies metadata headers directly into summaries:
# Singletrack Sessions — Episode ← Copied!
## Building Trails That Last ← Copied!
Host: Maya ← Copied!
Guest: Liam ← Copied!
Solution: Strip header blocks before summarization:
def strip_episode_header(text: str) -> str:
"""Remove episode title/metadata header block.
Args:
text: Transcript text possibly containing header metadata
Returns:
Text with header lines removed
"""
patterns = [
r"^#\s+.*?\n", # Markdown H1: "# Title"
r"^##\s+.*?\n", # Markdown H2: "## Subtitle"
r"^Host:\s*\w+.*?\n", # "Host: Maya"
r"^Guest:\s*\w+.*?\n", # "Guest: Liam"
r"^Guests?:\s*.*?\n", # "Guests: A, B, C"
]
cleaned = text
for pattern in patterns:
cleaned = re.sub(pattern, "", cleaned, flags=re.MULTILINE)
return cleaned.strip()
1.5 Current Gap: Profiles Exist But Aren't Wired¶
Important: The preprocessing profile system exists but is not connected to the summarization pipeline.
What exists today:
# In preprocessing/profiles.py - WORKS
register_profile("cleaning_v3", _cleaning_v3)
apply_profile(text, "cleaning_v3") # Can be called
What the summarizer actually does:
# In providers/ml/summarizer.py line 1222 - HARDCODED
cleaned_text = preprocessing.clean_for_summarization(text) # Ignores profiles!
The clean_for_summarization() function is essentially cleaning_v3 logic, but it bypasses the profile registry entirely.
What needs to change:
| Component | Current | Required |
|---|---|---|
summarizer.py |
Calls clean_for_summarization() |
Call apply_profile(text, profile_id) |
ExperimentConfig |
No profile field | Add preprocessing_profile: str field |
run_experiment.py |
Hardcodes "cleaning_v3" in fingerprint | Pass profile to summarizer |
| Provider interface | No profile param | Accept preprocessing_profile param |
Wiring path:
experiment.yaml # User specifies: preprocessing_profile: "cleaning_v4"
↓
run_experiment.py # Reads config, passes to provider
↓
MLProvider.summarize() # Accepts preprocessing_profile param
↓
summarize_long_text() # Passes to preprocessing
↓
apply_profile(text, profile_id) # Actually uses the profile!
Until this wiring is complete, creating cleaning_v4 won't have any effect on experiments.
1.7 New Preprocessing Profile: cleaning_v4¶
Combine all improvements into a new profile:
def _cleaning_v4(text: str) -> str:
"""Enhanced cleaning with speaker anonymization and header stripping."""
# 1. Strip episode header (title, Host:, Guest:)
cleaned = strip_episode_header(text)
# 2. Strip credits (before chunking)
cleaned = strip_credits(cleaned)
# 3. Strip garbage lines
cleaned = strip_garbage_lines(cleaned)
# 4. Anonymize speakers (Maya: → A:)
cleaned = anonymize_speakers(cleaned)
# 5. Standard cleaning (timestamps, normalize generic speakers)
cleaned = clean_transcript(
cleaned,
remove_timestamps=True,
normalize_speakers=True,
collapse_blank_lines=True,
remove_fillers=False, # Keep disabled for safety
)
# 6. Remove sponsor/outro blocks
cleaned = remove_sponsor_blocks(cleaned)
cleaned = remove_outro_blocks(cleaned)
# 7. Remove BART/LED artifacts
cleaned = remove_summarization_artifacts(cleaned)
return cleaned.strip()
2. Generation Parameter Tuning (Step B)¶
These parameters constrain generation to stop repetition and garbage output. They are tuned per-stage (Map vs Reduce) because each stage has different requirements.
2.1 Core Principle: Constrain Generation¶
For Hugging Face generate() on both Map and Reduce stages, add anti-loop controls:
| Control Type | Purpose |
|---|---|
do_sample: false |
Deterministic output (beam search, not sampling) |
no_repeat_ngram_size |
Prevents exact n-gram repetition |
repetition_penalty |
Penalizes token-level repetition |
encoder_no_repeat_ngram_size |
Prevents copying n-grams from input (future) |
2.2 Map Stage Parameters (BART)¶
The Map stage summarizes individual chunks. Use tighter constraints:
| Parameter | Current Baseline | Code Default | Recommended | Notes |
|---|---|---|---|---|
do_sample |
false | false | false | Already deterministic ✓ |
num_beams |
4 | 4 | 4-6 | 6 if stable on MPS |
no_repeat_ngram_size |
3 | 3 | 4 | Stricter repetition blocking |
repetition_penalty |
- | 1.3 (hardcoded) | 1.15-1.3 | Already set, consider exposing |
length_penalty |
1.0 | 1.0 | 1.0-1.1 | Slight preference for longer |
max_new_tokens |
200 | 150 | 150-200 | Per-chunk summary length |
min_new_tokens |
80 | - | 60-100 | Prevent ultra-short outputs |
early_stopping |
true | true | true | Already enabled ✓ |
Map Stage Config (matching actual schema):
map_params:
max_new_tokens: 200
min_new_tokens: 80
num_beams: 4
no_repeat_ngram_size: 4
length_penalty: 1.0
early_stopping: true
2.3 Reduce Stage Parameters (LED)¶
The Reduce stage combines Map summaries into final output. Allow longer output:
| Parameter | Current Baseline | Code Default | Recommended | Notes |
|---|---|---|---|---|
do_sample |
false | false | false | Already deterministic ✓ |
num_beams |
4 | 4 | 4 | Stable for longer context |
no_repeat_ngram_size |
3 | 3 | 4 | Stricter repetition blocking |
repetition_penalty |
- | 1.3 (hardcoded) | 1.12-1.3 | Already set, consider exposing |
length_penalty |
1.0 | 1.0 | 1.0 | Neutral |
max_new_tokens |
650 | 300 | 600-900 | Final summary length |
min_new_tokens |
220 | - | 200-350 | Ensure substantial output |
early_stopping |
true | true | true | Already enabled ✓ |
Reduce Stage Config (matching actual schema):
reduce_params:
max_new_tokens: 650
min_new_tokens: 220
num_beams: 4
no_repeat_ngram_size: 4
length_penalty: 1.0
early_stopping: true
2.4 Advanced Anti-Repetition Controls (Future Enhancement)¶
Note: These parameters are not currently exposed in the experiment config or summarizer code. Implementation requires changes to
run_experiment.pyandSummaryModel.summarize().
If repetition persists after applying the above, these additional knobs could help:
| Parameter | Value | Purpose |
|---|---|---|
encoder_no_repeat_ngram_size |
3 | Prevents copying 3-grams directly from input |
bad_words_ids |
[tokenizer.encode("Name:")] |
Block specific problematic tokens |
When encoder_no_repeat_ngram_size would help:
- Model copies phrases verbatim from input
- Summaries contain transcript artifacts
- Speaker labels appear despite preprocessing
# Example: Future implementation in generation call
outputs = model.generate(
input_ids,
encoder_no_repeat_ngram_size=3, # Prevents input copying
**other_params
)
Implementation Task: Add encoder_no_repeat_ngram_size support to GenerationParams schema and wire through the summarizer.
2.5 Chunk Size Parameters¶
| Setting | Words/Chunk | Trade-off |
|---|---|---|
| Smaller chunks | 600 | More focused map summaries, more chunks |
| Default | 800-1000 | Balance |
| Larger chunks | 1200 | More context per chunk, fewer chunks |
When to use smaller chunks: Dense technical content, many topic shifts
When to use larger chunks: Narrative content, fewer distinct topics
2.6 Parameter Tuning Strategy¶
Start with conservative settings and adjust based on output quality:
Repetition in output?
→ Increase no_repeat_ngram_size (4 → 5)
→ Note: repetition_penalty already at 1.3 (hardcoded)
→ Future: Add encoder_no_repeat_ngram_size: 3
Output too short?
→ Increase min_new_tokens
→ Increase length_penalty (1.0 → 1.1)
→ Check if input is being truncated
Output too long/rambling?
→ Decrease max_new_tokens
→ Decrease length_penalty (1.0 → 0.9)
Copying from input?
→ Future: Add encoder_no_repeat_ngram_size: 3
→ Check preprocessing (Step A)
3. Experiment Matrix¶
Run experiments systematically to isolate effects:
| Config ID | Preprocessing | Parameters | Hypothesis |
|---|---|---|---|
| v1 (baseline) | cleaning_v3 | default | Control |
| v2 | cleaning_v3 | +50% length | Longer outputs reduce "too short" |
| v3 | cleaning_v3 | ngram=4, beams=5 | Reduce repetition |
| v4 | cleaning_v3 | chunks=600 | More focused summaries |
| v5 | cleaning_v3 | chunks=1200 | More context |
| v6 | cleaning_v3 | best combined | Best parameters |
| v7 | cleaning_v4 | default | Preprocessing impact |
| v8 | cleaning_v4 | best combined | Maximum quality |
Key Insight: v7 isolates preprocessing impact, v8 combines both improvements.
Already Created Configs (parameter experiments v2-v6):
data/eval/configs/baseline_bart_v2_longer_output.yaml- +50% max_new_tokens, length_penalty=1.2data/eval/configs/baseline_bart_v3_stronger_ngram.yaml- ngram=4/5, beams=5data/eval/configs/baseline_bart_v4_smaller_chunks.yaml- chunks=600 wordsdata/eval/configs/baseline_bart_v5_larger_chunks.yaml- chunks=1200 wordsdata/eval/configs/baseline_bart_v6_combined_best.yaml- combined best params
To Be Created (preprocessing experiments v7-v8):
data/eval/configs/baseline_bart_v7_cleaning_v4.yamldata/eval/configs/baseline_bart_v8_cleaning_v4_optimized.yaml
3.1 Experiment Results (v1-v6)¶
Parameter tuning experiments (v1-v6) have been completed. Results validate the preprocessing-first approach:
Metrics Summary¶
| Experiment | Boilerplate Leak | Speaker Name Leak | Failed Episodes | Avg Tokens | Latency (ms) |
|---|---|---|---|---|---|
| v1 baseline | 0% | 80% | 0 | 445 | 37,179 |
| v2 longer output | 20% ⚠️ | 80% | 1 (p03_e01) | 607.4 | 52,943 |
| v3 stronger ngram | 20% ⚠️ | 80% | 1 (p05_e01) | 375.2 | 37,590 |
| v4 smaller chunks | 0% ✓ | 80% | 0 | 445 | 36,585 |
| v5 larger chunks | 0% ✓ | 80% | 0 | 477.4 | 38,430 |
| v6 combined | 20% ⚠️ | 80% | 1 (p05_e01) | 438.2 | 45,214 |
Key Findings¶
-
Speaker Name Leak is Constant at 80% — Proves preprocessing is the root cause, not generation parameters
-
Longer Output = More Garbage
- v2 produced more tokens (607 vs 445) but introduced boilerplate leakage
-
More room for output = more room for repetition loops
-
Stronger N-gram Blocking Didn't Help
- v3 with
no_repeat_ngram_size=4andnum_beams=5still showed 80% speaker leak -
Repetition blocking can't help when input itself is repetitive
-
Chunk Size Has Minimal Impact
- v4 (600 words) and v5 (1200 words) produced nearly identical quality
-
Both maintained 0% boilerplate leak (same as baseline)
-
Combined Parameters Combined Problems
- v6 inherited failures from v2/v3, didn't combine benefits
Conclusions¶
| Finding | Implication |
|---|---|
| 80% speaker leak across ALL experiments | Preprocessing is the bottleneck |
| Map-stage outputs already broken | Problem occurs before reduce stage |
| Parameter tweaks don't fix core issues | Input quality must be improved first |
| v1/v4/v5 are equivalently "best" | Simpler is better until preprocessing fixed |
Recommendation: Implement cleaning_v4 preprocessing before further parameter experiments. The 80% speaker name leak rate being constant across all 6 experiments is definitive proof that preprocessing must be addressed first.
Key Decisions¶
- Anonymous Labels Over Complete Removal
- Decision: Use
A:,B:,C:instead of removing speaker labels entirely -
Rationale: Preserves turn-taking structure useful for some downstream tasks
-
Profile-Based Preprocessing
- Decision: Create
cleaning_v4profile instead of modifyingcleaning_v3 -
Rationale: Maintains backward compatibility, allows A/B testing
-
Parameter Exposure via Config
- Decision: Expose key parameters in experiment YAML configs
- Rationale: Enables systematic experimentation without code changes
Alternatives Considered¶
- Aggressive Speaker Stripping
- Description: Remove all speaker labels entirely (
Maya: Hello→Hello) - Pros: Completely eliminates speaker leakage
- Cons: Loses turn-taking structure
-
Why Rejected: Anonymous labels achieve same goal while preserving structure
-
Model-Specific Preprocessing
- Description: Different preprocessing for BART vs LED vs OpenAI
- Pros: Optimal for each model
- Cons: Complexity explosion, harder to maintain
-
Why Rejected: Model-agnostic preprocessing is simpler and broadly effective
-
Fine-Tuning Instead of Parameter Tuning
- Description: Fine-tune BART on podcast data
- Pros: Potentially higher quality ceiling
- Cons: Requires training data, compute, maintenance
- Why Rejected: Out of scope; parameter tuning is sufficient for initial quality
Testing Strategy¶
Test Coverage:
- Unit tests:
strip_episode_header(),anonymize_speakers(),is_junk_line()edge cases - Integration tests: Full preprocessing pipeline with various input formats
- E2E tests: Complete summarization runs comparing v3 vs v4 profiles
Test Organization:
tests/
unit/
test_preprocessing_v4.py # New functions
integration/
test_preprocessing_profiles.py # Profile comparison
e2e/
test_summarization_quality.py # Quality metrics
Test Markers:
@pytest.mark.unit
def test_anonymize_speakers_basic(): ...
@pytest.mark.integration
def test_cleaning_v4_profile(): ...
@pytest.mark.e2e
@pytest.mark.slow
def test_summarization_with_v4(): ...
Test Execution:
# Quick validation
make test-unit
# Integration testing
make test-integration
# Full quality validation (slow)
make experiment-run CONFIG=data/eval/configs/baseline_bart_v7_cleaning_v4.yaml
Rollout & Monitoring¶
Rollout Plan:
- Phase 1 (Week 1): Implement preprocessing functions, unit tests
- Phase 2 (Week 2): Register
cleaning_v4profile, run experiments v1-v6 - Phase 3 (Week 3): Run experiments v7-v8, compare results
- Phase 4 (Week 4): Document findings, update defaults if quality improved
Monitoring:
| Metric | Current | Target | Measurement |
|---|---|---|---|
| Speaker Leak Rate | 80% | <10% | Manual review + automated detection |
| "Too Short" Rate | 100% | <20% | Summary token count |
| Repetition Artifacts | Present | Rare | Automated detection |
| Failed Episodes | 20% | <5% | Pipeline success rate |
Success Criteria:
- ✅ Speaker leak rate reduced to <10%
- ✅ "Too short" warnings reduced to <20%
- ✅ No regression in latency (within 20%)
- ✅ At least one v4-based config outperforms v3 baseline
Relationship to Other RFCs¶
This RFC (RFC-045) complements the ML quality improvement initiative:
- RFC-012: Episode Summarization - Core summarization implementation
- RFC-041: ML Benchmarking Framework - Measurement infrastructure
- RFC-044: Model Registry - Model configuration management
Key Distinction:
- This RFC (045): How to optimize ML quality (preprocessing + parameters)
- RFC-041: How to measure ML quality (benchmarking framework)
- RFC-044: How to configure ML models (registry)
Together, these RFCs provide a complete ML quality management system.
Benefits¶
- Immediate Quality Gains: Up to 70% reduction in speaker leakage
- Systematic Experimentation: Clear methodology for quality optimization
- Resource Efficiency: Extract maximum value from smaller models
- Model-Agnostic Foundation: Preprocessing improvements benefit all providers
- Reproducibility: Documented configs enable reproducible experiments
Migration Path¶
- Phase 1: Implement
cleaning_v4as opt-in profile - Phase 2: Run A/B experiments comparing v3 vs v4
- Phase 3: If v4 consistently outperforms, make it default in next minor version
- Phase 4: Deprecate v3 in future major version (with migration notice)
Backward Compatibility:
# Explicit v3 (old behavior)
preprocessing:
profile: "cleaning_v3"
# New v4 (opt-in initially)
preprocessing:
profile: "cleaning_v4"
Open Questions¶
- Filler Word Removal: Should
remove_fillers=Truebe enabled in v4? (Language-specific concerns) - Repetition Penalty Exposure: Should
repetition_penalty(currently hardcoded at 1.3) be exposed in YAML configs for tuning? - Encoder No-Repeat N-gram: Should
encoder_no_repeat_ngram_sizebe implemented to prevent input copying? - Chunk Overlap: Would overlapping chunks improve quality? (Note: already supported via
word_overlapparameter) - Multi-Language Support: How should preprocessing handle non-English transcripts?
Implementation Checklist¶
New Functions¶
- [ ]
is_junk_line(line: str) -> boolinpreprocessing/core.py - [ ]
strip_episode_header(text: str) -> strinpreprocessing/core.py - [ ]
anonymize_speakers(text: str) -> strinpreprocessing/core.py
New Profile¶
- [ ]
_cleaning_v4()function inpreprocessing/profiles.py - [ ]
register_profile("cleaning_v4", _cleaning_v4)call
Config Schema & Wiring (Critical Path)¶
- [ ] Add
preprocessing_profilefield toExperimentConfiginevaluation/config.py - [ ] Update
run_experiment.pyto read profile from config and pass to provider - [ ] Update
MLProvider.summarize()to acceptpreprocessing_profileparam - [ ] Update
summarize_long_text()to callapply_profile()instead ofclean_for_summarization() - [ ] Update
OpenAIProvider.summarize()similarly (if applicable)
Parameter Exposure (Future)¶
- [ ] Add
encoder_no_repeat_ngram_sizetoGenerationParamsschema - [ ] Expose
repetition_penaltyin experiment config (currently hardcoded at 1.3)
Experiment Configs¶
- [x]
baseline_bart_v2_longer_output.yaml(created) - [x]
baseline_bart_v3_stronger_ngram.yaml(created) - [x]
baseline_bart_v4_smaller_chunks.yaml(created) - [x]
baseline_bart_v5_larger_chunks.yaml(created) - [x]
baseline_bart_v6_combined_best.yaml(created) - [ ]
baseline_bart_v7_cleaning_v4.yaml - [ ]
baseline_bart_v8_cleaning_v4_optimized.yaml
Documentation¶
- [ ] Update
docs/guides/ML_MODEL_COMPARISON_GUIDE.md - [ ] Update
docs/guides/EXPERIMENT_GUIDE.md
References¶
- Source Documents:
docs/wip/preprocessing_improvements_plan.mddocs/wip/baseline_bart_experiment_plan.md- Preprocessing Code:
src/podcast_scraper/preprocessing/core.py - Profile Registry:
src/podcast_scraper/preprocessing/profiles.py - Experiment Configs:
data/eval/configs/ - Baseline Results:
data/eval/runs/