PRD-020: Audio-Based Speaker Diarization & Commercial Content Cleaning¶
- Status: Draft
- Related RFCs: RFC-058, RFC-059, RFC-060
- Supersedes non-goals in: PRD-008
Summary¶
Add true audio-based speaker diarization to the local transcription pipeline, replacing the current gap-based round-robin speaker rotation with neural voice-embedding-driven attribution. This enables accurate "who said what" labeling in screenplay-format transcripts, especially for interviews with rapid exchanges, multi-guest panels, and overlapping speech.
Downstream value — commercial content cleaning: Accurate diarization unlocks a critical downstream capability: identifying and removing host-read sponsor segments from transcripts before summarization. Commercial content leaking into summaries, quotes, and knowledge graphs fundamentally breaks the pipeline's core value. While expanded pattern matching and positional heuristics can improve commercial detection independently (RFC-060 Phase 1), diarization provides the "who + when" context needed to catch host-read ads that blend seamlessly into conversation (RFC-060 Phase 2). This makes diarization not just an accuracy improvement but an enabler of clean, trustworthy output across the entire pipeline.
Background & Context¶
The current speaker attribution system has two separate layers that do fundamentally different jobs:
-
Name detection (PRD-008, spaCy NER): Extracts host/guest names from RSS metadata and episode titles. This identifies who the speakers are but not when they speak.
-
Screenplay formatting (
format_screenplay_from_segments()): Assigns detected names to Whisper segments using a time-gap heuristic — when silence between segments exceedsscreenplay_gap_s, it rotates to the next speaker index round-robin. This has no knowledge of actual voice characteristics.
This gap-based rotation is fundamentally unreliable:
- Rapid exchanges (< gap threshold): Consecutive turns from different speakers get merged into one speaker
- Long pauses by the same speaker: Trigger a false speaker switch
- Multi-guest panels (3+ speakers): Round-robin cycles through speakers in fixed order, producing nonsensical attribution
- Cross-talk / overlap: Merged into whichever speaker index is current
PRD-008 explicitly listed audio-based diarization as a non-goal. With the maturation of pyannote.audio (v4.0+) and WhisperX, production-quality diarization is now accessible as an optional ML dependency, making it practical to offer as an opt-in enhancement.
Goals¶
- Replace gap-based speaker rotation with neural speaker diarization for local Whisper transcription
- Automatically detect the number of speakers from audio (eliminate manual
screenplay_num_speakers) - Produce screenplay transcripts with accurate, voice-consistent speaker labels
- Maintain the existing NER name-mapping pipeline (detect names from metadata, then map to diarized speaker IDs)
- Offer diarization as an optional, opt-in feature behind a CLI flag (
--diarize) - Keep the current gap-based fallback as default for users without GPU or HuggingFace token
- Enable diarization-enhanced commercial detection (RFC-060 Phase 2) — providing the "who + when" context that makes host-read sponsor identification reliable
Non-Goals¶
- Real-time / streaming diarization during transcription
- Cross-episode speaker identification (recognizing the same speaker across different episodes via voiceprints) — future consideration
- Replacing cloud transcription providers (OpenAI, Gemini, Mistral APIs) — diarization applies only to local Whisper path
- Custom speaker model training or fine-tuning
Personas¶
- Podcast Archivist: Maintains large transcript archives
- Needs accurate speaker labels across hundreds of episodes
-
Gap-based rotation produces unreliable archives that require manual correction
-
Researcher / Analyst: Studies conversation dynamics, discourse patterns
- Needs to know exactly who said what for citation and analysis
-
Current round-robin attribution makes quantitative speaker analysis meaningless
-
Developer / Integrator: Builds downstream tools on top of transcripts
- Needs structured, reliable speaker-attributed segments for NLP pipelines
- Unreliable attribution propagates errors into downstream systems (KG, GIL)
User Stories¶
-
As a Podcast Archivist, I can enable
--diarizeso that my screenplay transcripts accurately attribute each line to the correct speaker based on their voice. -
As a Researcher, I can process interview episodes and get correctly separated host vs. guest turns so that I can analyze speaking patterns and extract per-speaker quotes.
-
As a Developer, I can access diarized segments with speaker IDs and timestamps so that I can build speaker-aware downstream features (knowledge graphs, quote extraction).
-
As any operator, I can run without
--diarizeand get the same gap-based behavior as before so that nothing breaks if I lack a GPU or HuggingFace token.
Functional Requirements¶
FR1: Diarization Pipeline¶
- FR1.1: When
--diarizeis enabled, run speaker diarization on the audio file after (or during) Whisper transcription to produce a speaker timeline:[(start, end, speaker_id), ...] - FR1.2: Align Whisper transcription segments with the diarization timeline to assign each segment (or word) a speaker ID
- FR1.3: Auto-detect the number of speakers from audio (override with
--num-speakers Nif known) - FR1.4: Support a minimum of 2 and maximum of 20 speakers
FR2: Speaker Name Mapping¶
- FR2.1: Map diarized speaker IDs (e.g.,
SPEAKER_00,SPEAKER_01) to detected names from the existing NER pipeline (hosts from feed metadata, guests from episode titles) - FR2.2: Use heuristic mapping: first speaker in intro → likely host; map to detected host name. Remaining speakers → map to detected guest names in order
- FR2.3: Fall back to
SPEAKER_00,SPEAKER_01, etc. when NER name detection produces fewer names than diarized speakers
FR3: Integration with Existing Pipeline¶
- FR3.1: Diarization must be optional — disabled by default, enabled via
--diarizeCLI flag ordiarize: truein YAML config - FR3.2: When disabled, current gap-based
format_screenplay_from_segments()behavior is preserved exactly - FR3.3: Diarized output uses the same screenplay text format (
SPEAKER: text\n) for backward compatibility - FR3.4: Diarization result is cached alongside transcript cache (keyed by audio content hash + diarization config)
FR4: Configuration¶
- FR4.1: New CLI flags and config options:
| Option | Default | Description |
|---|---|---|
diarize |
false |
Enable audio-based speaker diarization |
hf_token |
None |
HuggingFace access token for pyannote models |
num_speakers |
None (auto) |
Override auto-detected speaker count |
min_speakers |
2 |
Minimum speakers for diarization |
max_speakers |
20 |
Maximum speakers for diarization |
diarization_device |
auto |
Device for diarization model (cpu, cuda, mps) |
- FR4.2: HuggingFace token can be provided via
--hf-token,HF_TOKENenv var, or~/.huggingface/token - FR4.3: Clear error message when
--diarizeis used without a valid HuggingFace token
FR5: Audio Preprocessing Compatibility¶
- FR5.1: Diarization must work with the existing FFmpeg preprocessing pipeline (mono, resample, loudnorm)
- FR5.2: If preprocessing is enabled, diarize the preprocessed audio (not the raw download)
Success Metrics¶
- Diarization Error Rate (DER) < 15% on a benchmark set of 10+ podcast episodes with known speaker turns
- Screenplay attribution accuracy: >= 85% of lines attributed to the correct speaker (manual spot-check on 5 episodes)
- Processing overhead: < 2x total wall-clock time compared to Whisper-only (with GPU)
- Zero regression: all existing tests pass with
diarize=false(default) - Zero impact on cloud provider paths (OpenAI, Gemini, Mistral transcription)
Dependencies¶
- PRD-008: Automatic Speaker Name Detection (NER pipeline provides names to map onto diarized speaker IDs)
- PRD-002: Whisper Fallback Transcription (provides Whisper segments to align with diarization)
- RFC-040: Audio Preprocessing (preprocessed audio feeds into diarization)
- External: HuggingFace account + accepted model terms for pyannote gated models
Constraints & Assumptions¶
Constraints:
- GPU strongly recommended for practical diarization speed (CPU: ~8.5 min for 60 min audio; GPU: ~75 sec with WhisperX)
- HuggingFace token required for pyannote model access (gated models)
- Audio must be available locally (diarization cannot run on cloud-transcribed episodes without audio)
- Added dependency weight: pyannote.audio pulls in speechbrain, asteroid, torchaudio
Assumptions:
- Users opting into
--diarizehave GPU access or accept significantly longer processing times on CPU - HuggingFace model terms are acceptable for the user's use case
- Podcast audio quality is generally sufficient for diarization (studio-recorded or high-quality remote)
- Most podcast episodes have 2-4 speakers
Design Considerations¶
DC1: WhisperX (Bundled) vs. Whisper + pyannote (Separate)¶
- Option A: WhisperX — Single library that wraps Whisper + pyannote + wav2vec2 alignment
- Pros: Single pipeline, batched inference (70x realtime), word-level timestamps (~50ms), VAD pre-filtering, 228x speedup in speaker assignment, near 100% GPU utilization
-
Cons: Replaces current Whisper integration entirely, heavier dependency, slightly lower diarization accuracy (~5% more misattributions vs raw pyannote), less control over individual components
-
Option B: Whisper (current) + pyannote.audio (new) — Keep existing Whisper, add pyannote as a second pass
- Pros: Minimal changes to existing Whisper code, best diarization accuracy, modular (can swap diarization engine later)
-
Cons: Manual alignment code needed, pyannote has known GPU utilization issues (~10%), slower overall, two separate model loads
-
Recommendation: Start with Option B (additive pyannote) for lower integration risk. Evaluate Option A (WhisperX) as a future optimization if diarization becomes a core feature and speed matters more. Option B preserves all existing Whisper code paths and tests.
DC2: Diarization Scope¶
- Segment-level diarization: Assign one speaker per Whisper segment (~5-30 sec chunks). Simpler, matches current screenplay format.
- Word-level diarization: Assign speakers per word. Higher accuracy for rapid exchanges but requires WhisperX or custom forced alignment.
- Recommendation: Start with segment-level (aligns with Option B above). Word-level is a natural follow-up if WhisperX is adopted.
Integration with Existing Pipeline¶
Audio-based diarization enhances the transcript pipeline by:
- Transcription stage: After Whisper produces segments, an optional diarization step assigns speaker IDs to each segment
- Speaker detection stage: NER-detected names (from PRD-008) are mapped to diarized speaker IDs instead of being assigned round-robin
- Screenplay formatting:
format_screenplay_from_segments()uses diarized speaker assignments instead of gap-based rotation - Cache: Diarization results cached alongside transcripts to avoid re-processing
- GIL / KG downstream: More accurate speaker attribution improves quote extraction and knowledge graph speaker nodes
- Commercial cleaning: Diarization data feeds into the
CommercialDetector(RFC-060) to boost confidence when identifying host-read sponsor segments — host monologues in mid-episode, topic discontinuity, and single-speaker candidate regions are strong signals
Example Output¶
Current (gap-based, often wrong):
Lenny Rachitsky: Welcome back to the show. Today I'm joined by an amazing guest.
Sarah Chen: Thank you so much for having me, Lenny. I've been looking forward to this.
Lenny Rachitsky: So tell me about your journey.
Sarah Chen: Well, it started about five years ago when I was working at...
Lenny Rachitsky: ...a startup in San Francisco.
Note: The last line is misattributed — it's Sarah continuing her sentence after a brief pause, but the gap triggered a speaker rotation.
With diarization (correct):
Lenny Rachitsky: Welcome back to the show. Today I'm joined by an amazing guest.
Sarah Chen: Thank you so much for having me, Lenny. I've been looking forward to this.
Lenny Rachitsky: So tell me about your journey.
Sarah Chen: Well, it started about five years ago when I was working at a startup in San Francisco.
Open Questions¶
- Should diarization be part of the
mloptional extra or a separatediarizeextra inpyproject.toml? - How to handle the HuggingFace token UX — should we guide users through model acceptance on first run?
- Should we expose diarization confidence scores in the output (e.g., per-segment confidence)?
- Priority: segment-level only, or invest in word-level from the start?
Related Work¶
- PRD-008: Automatic Speaker Name Detection (current NER-based system, lists audio diarization as non-goal)
- RFC-010: Speaker Name Detection Technical Design
- RFC-005: Whisper Integration
- RFC-040: Audio Preprocessing Pipeline
- pyannote/pyannote-audio — v4.0.4 (Feb 2026)
- m-bain/whisperX — v3.3.2, bundles Whisper + pyannote + alignment
- pyannote.audio documentation
Release Checklist¶
- [ ] PRD reviewed and approved
- [ ] RFC-NNN created with technical design (diarization provider, alignment algorithm, config schema)
- [ ] pyannote.audio added to optional
[diarize]extra inpyproject.toml - [ ]
DiarizationProviderimplemented with segment-level speaker assignment - [ ] Alignment logic: Whisper segments mapped to pyannote speaker IDs
- [ ] NER name mapping: diarized speaker IDs -> detected host/guest names
- [ ] CLI flags:
--diarize,--hf-token,--num-speakers - [ ] Cache integration: diarization results cached by audio hash
- [ ] Tests: unit tests for alignment, integration tests with sample audio
- [ ] Documentation: README section, config examples, HuggingFace setup guide
- [ ] Benchmark: DER measured on 10+ episodes, results documented
- [ ] Backward compatibility verified: all existing tests pass with
diarize=false