PRD-008: Automatic Speaker Name Detection¶
- Status: ✅ Implemented (v2.1.0)
- Related RFCs: RFC-010
Summary¶
Automatically detect and label speaker names (hosts and guests) in Whisper transcriptions using Named Entity Recognition (NER), eliminating the need for manual --speaker-names configuration.
Background & Context¶
When Whisper generates transcriptions, speakers are labeled generically as "Host" and "Guest" unless the user manually provides names via --speaker-names. This manual step:
- Requires feed-specific knowledge that users may not have
- Breaks when episodes have unique guests
- Adds friction when running at scale or embedding the pipeline
- Reduces transcript quality and readability
Users want transcripts that automatically use real speaker names without requiring manual configuration for each podcast or episode.
Goals¶
- Automatically extract host names from RSS feed metadata (author tags, titles, descriptions)
- Automatically detect guest names from episode metadata (titles, descriptions)
- Provide speaker names to Whisper screenplay formatting for higher-quality transcripts
- Maintain backward compatibility with manual
--speaker-namesas fallback - Support multiple languages through configurable NER models
Non-Goals¶
- Voice fingerprinting or audio-based speaker identification
- Multi-speaker diarization beyond host/guest distinction
- Real-time speaker detection during transcription
User Stories¶
-
As a podcast archivist, I want transcripts to automatically show "Lenny Rachitsky" and "guest name" instead of "Host" and "Guest" so my archive is more searchable and readable.
-
As a researcher, I want to process hundreds of episodes without manually looking up speaker names for each one.
-
As a developer integrating the pipeline, I want automatic speaker detection so I don't need to maintain a database of podcast hosts.
Functional Requirements¶
Host Detection (Feed-Level)¶
- Extract host names from RSS author tags (highest priority):
- RSS 2.0
<author>tag - iTunes
<itunes:author>tag -
iTunes
<itunes:owner><itunes:name>tag -
Fall back to NER extraction from feed title/description if no author tags
-
Validate detected hosts by checking if they appear in episode content
Guest Detection (Episode-Level)¶
- Extract guest names from episode titles (e.g., "Interview with John Smith")
- Extract from episode descriptions using NER
- Filter out detected hosts to avoid duplication
Configuration¶
| Option | Default | Description |
|---|---|---|
auto_speakers |
true |
Enable automatic speaker detection |
language |
"en" |
Language for NER model selection |
ner_model |
derived from language | Override NER model (e.g., en_core_web_sm) |
speaker_names |
[] |
Manual fallback names if detection fails |
CLI Flags¶
```bash
Enable/disable auto detection¶
--auto-speakers / --no-auto-speakers
Set language (affects both NER and Whisper)¶
--language en
Override NER model¶
--ner-model en_core_web_sm
Manual fallback¶
--speaker-names "Lenny" "Guest" ```python
Success Criteria¶
- ✅ Host names correctly detected from RSS author metadata
- ✅ Guest names correctly extracted from episode titles/descriptions
- ✅ Detected names appear in Whisper screenplay output
- ✅ Manual
--speaker-namesworks as fallback when detection fails - ✅ No regression in transcript quality when names aren't detected
- ✅ Performance impact < 100ms per episode for NER processing
Dependencies¶
- spaCy: NER library for named entity extraction
- spaCy language models:
en_core_web_sm(English), others for additional languages
Related Documents¶
- RFC-010: Automatic Speaker Name Detection - Technical implementation
- PRD-002: Whisper Fallback Transcription - Whisper integration
- Architecture - System design