Release v2.1.0 - Automatic Speaker Detection & Metadata Generation Design¶
Release Date: November 13, 2025 Type: Minor Release
Summary¶
v2.1.0 introduces automatic speaker name detection using Named Entity Recognition (NER) and establishes the design foundation for per-episode metadata generation. This release significantly enhances the transcript pipeline with intelligent host and guest identification, while laying the groundwork for comprehensive metadata documents.
What's New¶
✨ RFC-010: Automatic Speaker Name Detection (Implemented)¶
Automatic host and guest identification from episode metadata:
- Named Entity Recognition (NER): Uses spaCy to automatically extract person names from episode titles and descriptions
- RSS Author Tag Support: Prioritizes RSS
<author>,<itunes:author>, and<itunes:owner>tags for host identification - Host/Guest Distinction: Intelligently distinguishes recurring hosts from episode-specific guests
- Host Validation: Validates detected hosts using the first episode's metadata
- Pattern-Based Heuristics: Analyzes sample episodes to learn title patterns (position, prefixes, suffixes) for better guest selection
- Confidence Scoring: Uses overlap detection and confidence scores to select the best guest candidate
- Name Sanitization: Removes punctuation, normalizes whitespace, and deduplicates extracted names
- Graceful Fallback: Falls back to manual speaker names if automatic detection fails
- Language-Aware: Single
languageconfiguration drives both Whisper model selection and NER processing
Configuration:
--auto-speakers/auto_speakers: Enable automatic speaker detection (default:true)--language/language: Language for both Whisper and NER (default:"en")--ner-model/ner_model: spaCy model to use (default:"en_core_web_sm")--cache-detected-hosts/cache_detected_hosts: Cache host detection across episodes (default:true)--speaker-names/speaker_names: Manual fallback names (first = host, second = guest)
Features:
- Auto-downloads spaCy models when needed (similar to Whisper)
- Clear logging of detected hosts and guests at INFO level
- Detailed extraction details available at DEBUG level
- Works seamlessly in dry-run mode
- Pattern analysis from first few episodes improves accuracy
📋 PRD-004 & RFC-011: Metadata Generation Design¶
Comprehensive design documents for per-episode metadata generation:
- PRD-004: Product Requirements Document defining metadata schema, use cases, and functional requirements
- RFC-011: Technical design document with:
- Pydantic model definitions for type-safe metadata structure
- Database integration design (PostgreSQL, MongoDB, Elasticsearch, ClickHouse)
- ID generation strategy (feed_id, episode_id, content IDs)
- Unified JSON format with
snake_casefield names and ISO 8601 date serialization - Database loading examples for all target databases
Key Design Decisions:
- Opt-in feature (default
falsefor backwards compatibility) - JSON (default) and YAML format support
- Database-friendly schema for direct ingestion without transformation
- Stable, deterministic ID generation suitable for primary keys
- Schema versioning strategy (semantic versioning starting at 1.0.0)
Note: Implementation will follow in a future release. This release establishes the design foundation.
Improvements¶
RSS Parser Enhancements¶
- Author Tag Extraction: Extracts RSS
<author>,<itunes:author>, and<itunes:owner>tags from feed channel - HTML Stripping: Removes HTML tags and decodes entities from episode descriptions
- Improved Episode Description Extraction: Better handling of HTML content in RSS descriptions
Code Quality & Tooling¶
- CI/CD Alignment: GitHub Actions workflow now uses
make cidirectly, ensuring local and CI checks match exactly - Markdown Linting: Added markdownlint to CI pipeline, catching formatting issues early
- Complexity Management: Added complexity warnings to
.flake8per-file-ignores for appropriate modules - Code Cleanup: Removed unused variables and imports
Documentation Updates¶
- ARCHITECTURE.md: Updated to reflect speaker detection integration and metadata generation design
- TESTING_STRATEGY.md: Expanded with speaker detection and metadata generation testing requirements
- PRD-002 & PRD-003: Updated to reflect RFC-010 speaker detection features
- RFC-010: Expanded with final implementation details (RSS author tags, host validation, pattern heuristics)
- RFC-011: Comprehensive technical design for metadata generation
Technical Details¶
New Dependencies¶
- spacy>=3.7.0: Required dependency for Named Entity Recognition
Module Changes¶
speaker_detection.py(new): Complete NER-based speaker detection implementationrss_parser.py: Enhanced with author tag extraction and HTML strippingworkflow.py: Integrated speaker detection into pipelineepisode_processor.py: Updated to use detected speaker nameswhisper_integration.py: Renamed fromwhisper.pyto avoid naming conflictsmodels.py: Addedauthorsfield toRssFeeddataclass
Configuration Changes¶
New configuration fields:
language: "en" # Language for Whisper and NER
auto_speakers: true # Enable automatic speaker detection
ner_model: "en_core_web_sm" # spaCy model name
cache_detected_hosts: true # Cache host detection
```yaml
- Closes #21: Implement NER as per RFC-010
- Closes #15: Create PRD and RFC for generating metadata document per episode
- Related to #28: CI/CD improvements
## Documentation
- Speaker Detection: [RFC-010](../rfc/RFC-010-speaker-name-detection.md)
- Metadata Generation Design: [PRD-004](../prd/PRD-004-metadata-generation.md), [RFC-011](../rfc/RFC-011-metadata-generation.md)
- Architecture: [ARCHITECTURE.md](../architecture/ARCHITECTURE.md)
- Testing Strategy: [TESTING_STRATEGY.md](../architecture/TESTING_STRATEGY.md)
## Migration Notes
### For Users Upgrading from v2.0.1
1. **New Required Dependency**: Install spaCy models automatically or manually:
```bash
python -m spacy download en_core_web_sm
- Automatic Speaker Detection: Enabled by default (
auto_speakers: true). To disable:
auto_speakers: false
-
Manual Speaker Names: Now used as fallback only if automatic detection fails. First name = host, second = guest.
-
Language Configuration: Single
languagefield now controls both Whisper and NER:
language: "en" # Used for both Whisper model selection and NER
- No Breaking Changes: All existing functionality preserved. New features are additive.
Configuration Example¶
rss: "https://example.com/feed.xml"
output_dir: "./transcripts"
language: "en" # New: Controls Whisper + NER
auto_speakers: true # New: Enable automatic detection
ner_model: "en_core_web_sm" # New: spaCy model
cache_detected_hosts: true # New: Cache hosts across episodes
speaker_names: "Host, Guest" # Fallback if detection fails
```text
- 63 tests passing
- Comprehensive coverage for speaker detection:
- RSS author tag extraction
- NER model loading and validation
- Host detection from feed metadata
- Guest detection from episode metadata
- Pattern analysis and heuristics
- Name sanitization and deduplication
- Fallback to manual names
- Dry-run mode support
## Contributors
- Marko Dragoljevic (@chipi)
## Next Steps
- Implement metadata generation module (`podcast_scraper/metadata.py`) per RFC-011
- Integrate metadata generation into workflow pipeline
- Add configuration fields and CLI flags for metadata generation
- Generate metadata documents alongside transcripts
**Full Changelog**: <https://github.com/chipi/podcast_scraper/compare/v2.0.1...v2.1.0>