ADR-065: Additive pyannote Diarization with Separate [diarize] Extra¶
- Status: Accepted
- Date: 2026-04-03
- Authors: Podcast Scraper Team
- Related RFCs: RFC-058
- Related PRDs: PRD-020
Context & Problem Statement¶
The current screenplay formatting uses a time-gap heuristic to assign speakers: if a
silence gap exceeds gap_s, the next segment is attributed to a different speaker via
round-robin cycling. This produces systematically wrong attribution in rapid exchanges,
same-speaker pauses, and multi-guest panels.
Neural speaker diarization (voice-embedding-driven "who said what") solves this, but there are multiple integration strategies: replace Whisper entirely with WhisperX, add pyannote as a second pass after Whisper, or use cloud diarization APIs.
Decision¶
We add pyannote.audio as an additive second pass after Whisper transcription, with diarization as a separate optional dependency group.
- Additive, not replacement: Whisper transcription is preserved as-is. pyannote
runs after Whisper to produce speaker timelines, which are aligned to Whisper
segments via maximum-overlap matching. The existing gap-based path is the default
(
diarize=false). - Segment-level diarization: One speaker is assigned per Whisper segment (not per word). This matches the current screenplay format and avoids the complexity of forced word-level alignment.
- Waveform loading via
torchaudio: Audio is loaded withtorchaudio.load()and passed as a waveform to pyannote, avoiding a known 4x performance penalty when passing file paths (pyannote issue #1702). - Separate
[diarize]optional extra: pyannote addsspeechbrain,asteroid, and HuggingFace model downloads — significantly heavier than the base[ml]extra. Users who want Whisper without diarization should not pay this dependency cost. Install viapip install -e ".[diarize]". - Lazy import: pyannote dependencies are imported at function level (matching
ADR-005 pattern) so the package loads normally without
[diarize]installed.
Rationale¶
- Additive: Lower integration risk than replacing the entire Whisper pipeline with WhisperX. All existing Whisper code paths, tests, and behaviors are preserved. If pyannote has issues, the gap-based fallback works.
- Segment-level: Simpler alignment algorithm, matches current screenplay format. Word-level diarization requires forced alignment (WhisperX territory) and can be added later if needed.
- Waveform loading: Measured 12s vs 50s for 3-minute clips. This is a straightforward performance optimization that should always be used.
- Separate extra: pyannote's transitive dependency tree (
speechbrain,asteroid, HuggingFace model downloads) is large. Bundling it into[ml]would penalize every ML user, not just those who want diarization.
Alternatives Considered¶
- WhisperX as full pipeline replacement: Rejected; replaces proven Whisper integration, slightly lower diarization accuracy (~5%), larger blast radius. Can be evaluated after diarization proves value.
- Voice-activity-based speaker change detection (Silero VAD): Rejected; still a heuristic with no voice identity. Only marginally better than gap-based rotation.
- Cloud diarization APIs (Google, AssemblyAI): Rejected; per-minute costs, vendor lock-in, no offline support. Conflicts with local-first philosophy.
- Merge
[diarize]into[ml]: Rejected; significantly increases dependency weight for all ML users.
Consequences¶
- Positive: Accurate, voice-based speaker attribution. Auto speaker count detection
(eliminates manual
screenplay_num_speakers). Multi-speaker panels correctly handled. Downstream quality improves for GIL quotes and KG speaker nodes. - Negative: HuggingFace token required (gated model). GPU strongly recommended (CPU ~8.5 min vs GPU ~1.5 min for 60 min audio). New dependency group to maintain.
- Neutral:
diarize=falseis the default. Zero impact on users who do not opt in.
Implementation Notes¶
- Module:
src/podcast_scraper/providers/ml/diarization/ - Protocol:
DiarizationProvider(PEP 544) withdiarize()method - Alignment: Maximum-overlap matching between Whisper segments and diarization timeline, with carry-forward for gaps
- Caching: Diarization results cached by
sha256(audio_content) + diarization_config_hashin.cache/diarization/ - Config:
diarize: bool = False,hf_token,num_speakers,min_speakers,max_speakers,diarization_device,diarization_model - Relationship to ADR-005: Follows lazy ML dependency loading pattern
- Relationship to ADR-036: Preprocessed audio (RFC-040) feeds into diarization