RFC-059: Speaker Detection Refactor & Test Audio Improvements¶
- Status: Draft
- Authors: Architecture Review
- Stakeholders: Core Pipeline, ML Providers, Test Infrastructure
- Related PRDs:
docs/prd/PRD-020-audio-speaker-diarization.mddocs/prd/PRD-008-speaker-name-detection.mddocs/prd/PRD-002-whisper-fallback.md- Related RFCs:
docs/rfc/RFC-010-speaker-name-detection.md(current NER design)docs/rfc/RFC-058-audio-speaker-diarization.md(companion: diarization capability)docs/rfc/RFC-060-diarization-aware-commercial-cleaning.md(downstream: diarization-aware sponsor detection)docs/rfc/RFC-040-audio-preprocessing-pipeline.md(audio preprocessing)docs/rfc/RFC-018-test-structure-reorganization.md(test organization)docs/rfc/RFC-054-e2e-mock-response-strategy.md(E2E test strategy)- Related Issues:
- Issue #269: Refactor
speaker_detection.pyintospeaker_detectors/submodules - Issue #111: Use unique voices for each speaker in generated mock podcast audio
- Issue #109: Add commercial segments to mock podcast transcripts
- Issue #414: Audio Pipeline Separation
- Related Documents:
tests/fixtures/FIXTURES_SPEC.mdtests/fixtures/scripts/transcripts_to_mp3.py
Abstract¶
This RFC addresses two related infrastructure improvements that collectively prepare the audio and speaker detection systems for the diarization capability defined in RFC-058:
- Speaker detection modularization — Refactor
speaker_detection.py(1,376 lines) into focused submodules withinspeaker_detectors/(Issue #269) - Test audio improvements — Assign unique TTS voices per speaker and add commercial segments to mock transcripts (Issues #111, #109)
These are prerequisite and companion work for RFC-058 (diarization): modular speaker detection is needed to cleanly integrate diarization name-mapping, and distinct test voices are required for diarization to actually distinguish speakers in test audio.
Note: Audio chunking for API providers (Issue #286) was originally scoped in this RFC but has been moved out as a standalone item — it is independent of diarization and speaker detection.
Architecture Alignment: Follows RFC-016 modularization principles and matches existing patterns in summarization/ (chunking, map_reduce, prompts submodules).
Problem Statement¶
Two distinct but related problems need solving:
Problem 1: Monolithic Speaker Detection (Issue #269)¶
speaker_detection.py is 1,376 lines containing 23+ functions spanning seven concerns: spaCy model loading, entity extraction, name validation, pattern analysis, title position analysis, speaker detection orchestration, and guest detection scoring. This monolith:
- Makes it hard to add new detection strategies (e.g., diarization-based mapping from RFC-058)
- Complicates testing (tests need the whole module to test one concern)
- Violates the modularization pattern used elsewhere (e.g.,
summarization/withchunking.py,map_reduce.py,prompts.py)
Problem 2: Indistinguishable Test Audio (Issue #111)¶
The TTS audio generation script (transcripts_to_mp3.py) uses only two voices for all speakers:
- Host voice:
Samantha(en_US) for all hosts (Maya, Ethan, Rina, Leo, Nora) - Guest voice:
Daniel(en_GB) for all guests
This means all hosts sound identical and all guests sound identical. When RFC-058 adds diarization, pyannote cannot distinguish speakers in test audio because they share the same voice. Additionally, there are no commercial segments (Issue #109), making it impossible to test sponsor-block cleaning against realistic fixtures.
Use Cases:
- Diarization integration: Clean speaker detection module with clear extension points for diarization name-mapping
- Diarization testing: Test audio where pyannote can actually distinguish speakers by voice
- Cleaning validation: Test transcripts with commercial segments for
remove_sponsor_blocks()testing
Goals¶
- Modularize speaker detection into 6-7 focused submodules with clear responsibilities
- Maintain backward compatibility:
speaker_detection.pyremains as public API (thin re-export wrapper) - Unique TTS voices: Each fixture speaker gets a distinct macOS
sayvoice with accent variety - Commercial segments: Add realistic sponsor blocks to all 15 mock transcripts
- Enable diarization testing: Test audio fixtures usable for pyannote speaker separation validation
Constraints & Assumptions¶
Constraints:
- Speaker detection refactor must not break any existing tests (zero-diff behavior)
- Test audio generation requires macOS (uses
/usr/bin/say) — CI runs on Linux, so audio fixtures are pre-generated and checked in (Git LFS) - Commercial segments must use patterns detectable by existing
remove_sponsor_blocks()and the evolving eval tooling
Assumptions:
- macOS
sayhas sufficient voice variety (~20+ English voices with different accents) - pyannote can distinguish macOS TTS voices that use different voice models
- Fixture regeneration is an infrequent, developer-initiated operation
Design & Implementation¶
1. Speaker Detection Modularization (Issue #269)¶
Current structure:
src/podcast_scraper/
├── providers/ml/
│ └── speaker_detection.py # 1,376 lines — ALL logic
└── speaker_detectors/
├── __init__.py
├── base.py # SpeakerDetector protocol (61 lines)
└── factory.py # Factory function (67 lines)
Target structure:
src/podcast_scraper/
├── providers/ml/
│ └── speaker_detection.py # ~80 lines — thin re-export wrapper
└── speaker_detectors/
├── __init__.py # Re-exports public API
├── base.py # SpeakerDetector protocol (existing)
├── factory.py # Factory function (existing)
├── constants.py # All threshold/configuration constants
├── ner.py # spaCy model loading & management
├── entities.py # Entity extraction from text (NER)
├── normalization.py # Name validation & sanitization
├── patterns.py # Pattern analysis for titles/descriptions
├── guests.py # Guest detection scoring & selection
├── hosts.py # Host detection from feed/transcript
└── detection.py # Main orchestration (detect_speaker_names)
Module responsibilities:
| Module | Lines (est.) | Responsibility | Key functions |
|---|---|---|---|
constants.py |
~60 | All threshold and scoring constants | MIN_NAME_LENGTH, POSITION_SCORE_BONUS, etc. |
ner.py |
~100 | spaCy model loading, validation, download | _load_spacy_model, get_ner_model, _validate_model_name |
entities.py |
~160 | Extract PERSON entities from text via spaCy | extract_person_entities, _extract_entities_from_doc, _extract_entities_from_segments |
normalization.py |
~80 | Name sanitization and validation | _sanitize_person_name, _validate_person_entity, _extract_confidence_score |
patterns.py |
~160 | Title position analysis, prefix/suffix, pattern fallback | analyze_episode_patterns, _pattern_based_fallback, _find_common_patterns |
guests.py |
~180 | Guest candidate scoring, selection, context-aware filtering | _build_guest_candidates, _select_best_guest, _is_likely_actual_guest |
hosts.py |
~120 | Host detection from feed metadata and transcript intro | detect_hosts_from_feed, detect_hosts_from_transcript_intro |
detection.py |
~200 | Main orchestration, speaker list building | detect_speaker_names, _build_speaker_names_list |
Migration approach (preserving backward compatibility):
# speaker_detection.py — becomes thin re-export wrapper
"""Backward-compatible re-export of speaker detection public API.
All logic has moved to speaker_detectors/ submodules.
This module re-exports the public API for existing imports.
"""
from ..speaker_detectors.constants import DEFAULT_SPEAKER_NAMES
from ..speaker_detectors.detection import detect_speaker_names
from ..speaker_detectors.entities import extract_person_entities
from ..speaker_detectors.guests import _has_guest_intent_cue
from ..speaker_detectors.hosts import detect_hosts_from_feed
from ..speaker_detectors.ner import get_ner_model
from ..speaker_detectors.normalization import (
filter_default_speaker_names,
is_default_speaker_name,
)
__all__ = [
"detect_speaker_names",
"detect_hosts_from_feed",
"get_ner_model",
"extract_person_entities",
"DEFAULT_SPEAKER_NAMES",
"is_default_speaker_name",
"filter_default_speaker_names",
"_has_guest_intent_cue",
]
Phased extraction order:
constants.py— no dependencies, pure valuesnormalization.py— depends only on constantsner.py— depends on config, constantsentities.py— depends on normalization, constantspatterns.py— depends on entities, constantsguests.py— depends on entities, patterns, constantshosts.py— depends on entities, constantsdetection.py— depends on guests, hosts, entities, normalization, constants
After each phase: run make ci-fast to verify zero regressions.
2. Test Audio Voice Improvements (Issue #111)¶
Speaker-to-voice mapping:
SPEAKER_VOICE_MAP: dict[str, str] = {
# Hosts (varied accents)
"Maya": "Samantha", # en_US female
"Ethan": "Alex", # en_US male
"Rina": "Karen", # en_AU female
"Leo": "Daniel", # en_GB male
"Nora": "Moira", # en_IE female
"Alex": "Evan", # en_US male (p07-p09)
# Guests (varied accents for speaker distinction)
"Liam": "Fred", # en_US male
"Sophie": "Flo", # en_GB female
"Noah": "Tom", # en_US male
"Priya": "Isha", # en_IN female
"Jonas": "Eddy", # en_US male (distinct character)
"Camila": "Paulina", # es_MX female
"Marco": "Luca", # it_IT male
"Hanna": "Anna", # de_DE female
"Owen": "Reed", # en_US male
"Ava": "Kathy", # en_US female
"Tariq": "Rishi", # en_IN male
"Elise": "Amelie", # fr_CA female
"Daniel": "Oliver", # en_GB male
"Isabel": "Monica", # es_ES female
"Kasper": "Ralph", # en_US male
}
Fallback strategy:
# Hash-based fallback for unmapped speakers
FALLBACK_VOICES = ["Albert", "Bruce", "Junior", "Nicky", "Ralph", "Shelley", "Trinoids"]
def get_voice_for_speaker(speaker_name: str, is_host: bool) -> str:
if speaker_name in SPEAKER_VOICE_MAP:
return SPEAKER_VOICE_MAP[speaker_name]
# Deterministic hash-based selection
idx = hash(speaker_name) % len(FALLBACK_VOICES)
return FALLBACK_VOICES[idx]
Changes to transcripts_to_mp3.py:
- Replace binary host/guest voice logic with per-speaker voice lookup
- Parse
Speaker: textlines to extract actual speaker name (not just host vs. non-host) - Each unique speaker name maps to a unique voice via
SPEAKER_VOICE_MAP - Keep
--host-voiceand--guest-voiceCLI args as fallback defaults - Add
--list-voicesto show available macOS voices
3. Commercial Segments in Transcripts (Issue #109)¶
Placement per episode:
- Opening ad — After intro greeting, before main content (~30-60s)
- Mid-roll ad — At ~50% through content, natural conversation break (~30-60s)
- Closing ad — Before outro/signoff (~30-60s)
Template for host-read ads:
Maya: This episode is brought to you by Stripe. Stripe makes it easy to accept
payments online and in person. Whether you're building a marketplace or a
subscription service, Stripe handles the complexity. Get started at
stripe.com/podcast. That's stripe.com/podcast.
Template for mid-roll break:
Maya: We'll be right back after a quick word from our sponsors.
Ad: Today's episode is sponsored by Linear. Linear is the issue tracker built
for speed. With keyboard shortcuts, powerful search, and beautiful design,
Linear makes project management effortless. Try Linear free at
linear.app/podcast.
Maya: Welcome back. We were just discussing...
Brand distribution across episodes:
- Use well-known podcast sponsor brands: Figma, Stripe, Linear, Notion, Vanta, Miro, Zapier, HubSpot, Squarespace, Shopify
- Each episode gets 3 different brands (opening, mid-roll, closing)
- No brand repeats within an episode; brands can repeat across episodes
- Commercial content uses phrases matching existing
SPONSOR_PATTERNSinpreprocessing.py
Speaker label for ads:
- Host-read ads use the host's speaker label (e.g.,
Maya:) - Pre-recorded ads use
Ad:label — this exercises multi-speaker parsing transcripts_to_mp3.pyneeds anAdvoice entry (e.g.,"Ad": "Zarvox"for distinctly synthetic)
Key Decisions¶
- Speaker detection: modularize, don't rewrite
- Decision: Extract functions into submodules with identical behavior;
speaker_detection.pybecomes a re-export wrapper -
Rationale: Zero behavior change reduces risk; existing 245+ tests validate correctness through the refactor
-
Voice mapping: explicit over hash-based
- Decision: Curated
SPEAKER_VOICE_MAPwith hash-based fallback for unknown speakers -
Rationale: Explicit mapping ensures voice quality and deterministic, reviewable assignments; fallback handles future fixture additions
-
Commercial segments: host-read style
- Decision: Most ads are host-read (using host speaker label); one mid-roll uses
Ad:label - Rationale: Matches real podcast patterns; exercises existing cleaning detection;
Ad:label tests multi-speaker parsing
Alternatives Considered¶
- Rewrite speaker detection from scratch
-
Why Rejected: Too risky; existing logic is well-tested and functional. Modularization preserves behavior while improving structure.
-
Use pyttsx3 instead of macOS
sayfor test audio -
Why Rejected: pyttsx3 voice variety is limited and platform-dependent; macOS
sayhas 20+ distinct English voices with accent variety. Test audio generation is already macOS-only. -
Generate fixture audio in CI (Linux)
- Why Rejected: Linux TTS options (espeak) have lower quality; fixture audio should be pre-generated and stable. Regeneration is an infrequent developer operation.
Testing Strategy¶
Speaker detection refactor:
- All existing tests in
tests/unit/podcast_scraper/test_speaker_detection.pymust pass unchanged - All integration tests in
tests/integration/providers/ollama/test_*_speaker.pymust pass - After refactor, add per-module unit tests in
tests/unit/podcast_scraper/speaker_detectors/ - Run
make ci-fastafter each extraction phase
Test audio improvements:
- Regenerate all fixture audio with unique voices:
cd tests/fixtures/scripts && bash generate_audio.sh - Manual verification: listen to samples to confirm voice distinction
- If RFC-058 diarization is implemented, run pyannote on fixture audio to verify it can distinguish speakers
Commercial segments:
- After adding commercials, verify
remove_sponsor_blocks()correctly removes commercial segments - Validate with the evolving eval tooling that cleaning quality improves
- Check that summaries generated from cleaned transcripts exclude sponsor content
Test Organization:
tests/
├── unit/podcast_scraper/
│ └── speaker_detectors/ # NEW — per-module unit tests
│ ├── test_constants.py
│ ├── test_ner.py
│ ├── test_entities.py
│ ├── test_normalization.py
│ ├── test_patterns.py
│ ├── test_guests.py
│ ├── test_hosts.py
│ └── test_detection.py
Rollout & Monitoring¶
Rollout Plan:
- Phase 1 (~6h): Speaker detection modularization (Issue #269)
- Extract modules one at a time;
make ci-fastafter each - PR with zero-diff behavior guarantee
- Phase 2 (~4h): Test audio voice improvements (Issue #111)
- Update
transcripts_to_mp3.pywith voice map - Regenerate fixture audio
- Verify voice distinction
- Phase 3 (~4h): Commercial segments (Issue #109)
- Add sponsor blocks to all 15 transcripts
- Regenerate audio
- Validate cleaning with evolving eval tooling
Success Criteria:
- All existing tests pass after speaker detection refactor (zero behavior change)
- Each fixture speaker has a distinct TTS voice (verified by listening)
- Commercial segments are detectable by the cleaning pipeline and validated via eval tooling
Relationship to Other RFCs¶
This RFC (RFC-059) is the companion to RFC-058 (Audio-Based Speaker Diarization):
- RFC-058: Adds the diarization capability (pyannote integration)
- RFC-059 (this): Improves the infrastructure that diarization depends on
Dependencies:
- RFC-058's diarization name-mapping integrates into the modularized
speaker_detectors/detection.py - RFC-058's integration tests depend on unique-voice test audio from this RFC
Benefits¶
- Maintainable speaker detection: 7 focused modules instead of 1 monolith
- Testable in isolation: Each concern can be unit-tested independently
- Diarization-ready: Clean extension points for RFC-058 name-mapping
- Realistic test fixtures: Unique voices enable meaningful diarization testing
- Cleaning validation: Commercial segments enable
remove_sponsor_blocks()testing
Migration Path¶
- Speaker detection: Existing imports from
speaker_detection.pycontinue to work (re-export wrapper). No external API changes. - Test audio: Regenerate fixtures; Git LFS handles the audio file updates. No test code changes (same filenames).
- Commercial segments: Transcript files updated in-place; tests that check transcript content may need minor updates.
Open Questions¶
- Should the refactored speaker detection modules live in
providers/ml/speaker_detection/(closer to current location) orspeaker_detectors/(current plan, already hasbase.pyandfactory.py)? - Should voice assignments in
SPEAKER_VOICE_MAPtry to match speaker name demographics (e.g., Priya → Indian accent)? - Should commercial segments use exact timestamps or relative positions?
References¶
- Issue #269: Refactor
speaker_detection.pyintospeaker_detectors/submodules - Issue #111: Use unique voices for each speaker in generated mock podcast audio
- Issue #109: Add commercial segments to mock podcast transcripts
- Issue #414: Audio Pipeline Separation
- Source Code:
src/podcast_scraper/providers/ml/speaker_detection.py - Source Code:
tests/fixtures/scripts/transcripts_to_mp3.py - Fixture Spec:
tests/fixtures/FIXTURES_SPEC.md